NOTE: “Content of this document is based on the course R Programming for Statistics and Data Science course”

Statistics

Statistics refers to the discipline of mathematics and science that involves collecting, analyzing, interpreting, presenting, and organizing data. It provides methods and techniques for making sense of numerical information to draw conclusions, make predictions, and inform decision-making.

Statistics can be divided into two main branches: descriptive statistics and inferential statistics.

Descriptive Statistics: These methods involve summarizing and describing the main features of a dataset. Common descriptive statistics include measures like mean (average), median (middle value), mode (most frequent value), range (difference between the highest and lowest values), and standard deviation (a measure of data dispersion).
Inferential Statistics: This branch deals with making predictions, drawing conclusions, and making inferences about a larger population based on a smaller sample of data. Inferential statistics uses probability theory to assess the likelihood of various outcomes and to estimate parameters of interest.

Statistics is widely used in various fields such as economics, social sciences, business, medicine, engineering, and more. It plays a crucial role in:

Research: Statistics helps researchers design experiments, analyze data, and draw meaningful conclusions from their studies. Decision-Making: Businesses and organizations use statistics to make informed decisions based on data analysis, market research, and trend forecasting.
Quality Control: Industries use statistical methods to monitor and maintain product quality by analyzing production processes. Public Policy: Governments and policymakers use statistics to gather information about populations, assess social trends, and develop effective policies.
Science: Scientists rely on statistics to analyze experimental results, validate hypotheses, and determine the significance of findings.
Sports Analytics: Sports teams and analysts use statistics to evaluate player performance, develop strategies, and make game-time decisions.

In the digital age, the availability of large data sets (big data) has further highlighted the importance of statistical techniques in extracting meaningful insights from vast amounts of information.

Population and Sample

Population: The population refers to the entire group of individuals, objects, events, or data points that are of interest to a researcher or analyst. It’s the complete set of items you want to study or draw conclusions about. For example, if you were interested in studying the average height of all adults in a country, the entire adult population of that country would be the population.
Sample: A sample is a subset of the population that is selected for the purpose of gathering data and making inferences about the population. Since it’s often impractical or impossible to collect data from an entire population, a sample is used as a representative subset. Properly selected samples can provide accurate information about the larger population without having to examine every individual within it. In the height example, you might select a few hundred individuals from different regions of the country to measure their heights, rather than measuring the heights of every single adult.

Sampling methods are used to select samples from populations. These methods can vary in terms of their randomness and representativeness, which affects the validity of the conclusions drawn from the sample data.

It’s important to note that the accuracy of statistical analyses and conclusions drawn from sample data depends on how well the sample represents the population. If the sample is chosen in a biased or non-random way, the results may not accurately reflect the population as a whole. To address this, statisticians use techniques like random sampling, stratified sampling, and other methods to ensure that the sample is as representative as possible.

In summary, a population is the entire group of interest, and a sample is a subset of that population used to draw conclusions and make inferences about the larger group.

Mean, median and mode

Mean: The mean, also known as the average, is the sum of all values in a dataset divided by the number of values. It’s a measure of central tendency that gives you an idea of the “typical” value in the dataset.

# Example data
data <- c(10, 15, 20, 25, 30)

# Calculating the mean
mean_value <- mean(data)
mean_value

## [1] 20

Median: The median is the middle value in a dataset when it’s arranged in ascending order. If there’s an odd number of values, the median is the middle value. If there’s an even number of values, the median is the average of the two middle values.

# Example data
data <- c(10, 15, 20, 25, 30, 35)

# Calculating the median
median_value <- median(data)
median_value

## [1] 22.5

Mode: The mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal) or multiple modes (multimodal), or it can have no mode if all values are equally distributed.

# Example data
data <- c(10, 15, 20, 25, 20, 30, 25)

# Calculating the mode
mode_value <- as.numeric(names(sort(table(data), decreasing = TRUE)[1]))
mode_value

## [1] 20

In the R code examples above, we’ve calculated the mean, median, and mode for different datasets. The mean() function calculates the mean, the median() function calculates the median, and for the mode, we’ve used a combination of the table() function to create a frequency table of the values and then identified the highest frequency value.

Please note that the mode calculation in the provided R code only captures one mode, and it assumes there’s at least one value with the highest frequency. If there are multiple values with the same highest frequency, the dataset is multimodal and would have multiple modes.

Skewness

Skewness is a statistical measure that describes the asymmetry of the probability distribution of a dataset. In simpler terms, it indicates whether the data is skewed to one side (left or right) or whether it is relatively symmetric.

There are three types of skewness:

Positive Skewness (Right Skewness): The right tail of the distribution is longer or stretched out, and the bulk of the distribution is concentrated on the left side. In other words, the majority of the data values are smaller, and there are a few very large values.
Negative Skewness (Left Skewness): The left tail of the distribution is longer or stretched out, and the bulk of the distribution is concentrated on the right side. In this case, the majority of the data values are larger, and there are a few very small values.
No Skewness (Symmetrical Distribution): The distribution is symmetric, meaning that the left and right sides are roughly mirror images of each other.

Here’s how you can calculate skewness in R using the skewness() function from the e1071 package:

# Install and load the required package
# install.packages("e1071")
library(e1071)

# Example data with positive skewness
data_positive_skew <- c(10, 15, 20, 25, 30, 35, 100)

# Example data with negative skewness
data_negative_skew <- c(5, 10, 15, 20, 25, 30, 100)

# Example data with no skewness (symmetrical)
data_symmetrical <- c(10, 15, 20, 25, 30, 35)

# Calculate skewness for each dataset
skewness_positive <- skewness(data_positive_skew)
skewness_negative <- skewness(data_negative_skew)
skewness_symmetrical <- skewness(data_symmetrical)

# Print the skewness values
skewness_positive

## [1] 1.36023

skewness_negative

## [1] 1.387943

skewness_symmetrical

## [1] 0

In this example, we have three datasets: one with positive skewness, one with negative skewness, and one with no skewness (symmetrical). We calculate the skewness using the skewness() function and print the skewness values for each dataset. Positive skewness indicates a right-skewed distribution, negative skewness indicates a left-skewed distribution, and a skewness value near zero indicates a roughly symmetrical distribution.

Variance

Variance is a statistical measure that quantifies the spread or dispersion of a set of data points around their mean (average) value. It gives you an idea of how much the individual data points deviate from the mean. A higher variance indicates greater variability, while a lower variance indicates less variability.

Mathematically, the variance of a dataset is calculated as the average of the squared differences between each data point and the mean of the dataset. Here’s the formula for population variance:

Population Variance (σ²) = Σ (xi - μ)² / N

Where:

xi is each data point
μ is the mean of the dataset
N is the number of data points

In R, you can calculate the variance using the var() function. Here’s an example:

# Example data
data <- c(10, 15, 20, 25, 30)

# Calculate variance
variance <- var(data)
variance

## [1] 62.5

In this example, the var() function calculates the variance of the data vector. The result is the variance of the dataset, which provides a measure of how the data points are spread out from the mean.

Keep in mind that the formula I provided is for population variance. If you’re working with a sample of data and want to estimate the variance of the entire population, you would use the sample variance formula, which divides by (N - 1) instead of N to correct for the bias introduced by using a sample. The var() function in R uses the sample variance formula by default when dealing with sample data.

Standard deviation

Standard deviation is a statistical measure that quantifies the amount of dispersion or variability in a set of data points. It’s closely related to variance and is often used as a more interpretable measure of how spread out the data is around the mean.

Mathematically, the standard deviation of a dataset is the square root of the variance. It provides a measure of the average distance between each data point and the mean of the dataset. A higher standard deviation indicates greater variability, while a lower standard deviation indicates less variability.

In R, you can calculate the standard deviation using the sd() function. Here’s an example:

# Example data
data <- c(10, 15, 20, 25, 30)

# Calculate standard deviation
standard_deviation <- sd(data)
standard_deviation

## [1] 7.905694

In this example, the sd() function calculates the standard deviation of the data vector. The result is the standard deviation of the dataset, which provides a measure of how the data points are spread out from the mean.

Here’s how the standard deviation is related to the variance:

Standard Deviation (σ) = √Variance

Both the variance and the standard deviation give you insights into the spread of the data, but the standard deviation is often preferred because it has the same unit of measurement as the original data, making it more intuitive to understand and interpret.

Coefficient of variation

The coefficient of variation (CV) is a relative measure of variability that is used to compare the standard deviation of a dataset to its mean. It’s expressed as a percentage and is particularly useful when comparing the variability of different datasets that have different units of measurement or scales.

The formula for calculating the coefficient of variation is:

Coefficient of Variation (CV) = (Standard Deviation / Mean) * 100%

A higher CV indicates a larger relative spread of data points around the mean, while a lower CV indicates a smaller relative spread.

Here’s an example of how to calculate the coefficient of variation using R:

# Example data
data <- c(10, 15, 20, 25, 30)

# Calculate mean and standard deviation
mean_value <- mean(data)
standard_deviation <- sd(data)

# Calculate coefficient of variation
coefficient_of_variation <- (standard_deviation / mean_value) * 100
coefficient_of_variation

## [1] 39.52847

In this example, we calculate the mean and standard deviation of the data vector and then use those values to compute the coefficient of variation. The result is a percentage that represents the relative variability of the dataset compared to its mean.

The coefficient of variation is especially useful when comparing datasets with different units or scales. For instance, if you’re comparing the variability of the heights and weights of individuals, the coefficient of variation can provide a way to make meaningful comparisons even though the units are different.

Covariance and correlation

Covariance and correlation are both statistical measures that describe the relationship between two variables in a dataset. They help quantify how changes in one variable are associated with changes in another variable. However, they have different scales and interpretations.

Covariance: Covariance measures the extent to which two variables vary together. If the covariance is positive, it indicates that when one variable is above its mean, the other variable tends to be above its mean as well. If the covariance is negative, it means that when one variable is above its mean, the other variable tends to be below its mean. Mathematically, the covariance between two variables X and Y is calculated as:

Cov(X, Y) = Σ [(xi - X̄) * (yi - Ȳ)] / (n - 1)

Where:

xi and yi are the individual data points of X and Y respectively
X̄ and Ȳ are the means of X and Y respectively
n is the number of data points

Correlation: Correlation is a standardized version of covariance. It measures the strength and direction of the linear relationship between two variables. The correlation coefficient ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear correlation.

Mathematically, the correlation coefficient (Pearson correlation coefficient) between two variables X and Y is calculated as:

Correlation(X, Y) = Cov(X, Y) / (SD(X) * SD(Y))

Where SD(X) and SD(Y) are the standard deviations of X and Y respectively.

Here are examples of calculating covariance and correlation using R:

# Example data
x <- c(10, 15, 20, 25, 30)
y <- c(20, 18, 25, 28, 35)

# Calculate covariance
covariance <- cov(x, y)
covariance

## [1] 50

# Calculate correlation
correlation <- cor(x, y)
correlation

## [1] 0.9355605

In this example, we calculate the covariance and correlation between two variables x and y. The cov() function calculates the covariance, and the cor() function calculates the correlation. The result for covariance is a matrix, and the result for correlation is a single value indicating the Pearson correlation coefficient between the two variables.

Distributions

In statistics, a distribution refers to the way values are spread or distributed across a set of data. It describes the pattern of frequencies or probabilities of different outcomes or values in a dataset. Distributions are fundamental to understanding and analyzing data, as they provide insights into the central tendency, variability, and other characteristics of the data.

There are various types of distributions used in statistics. Some common ones include:

Normal Distribution (Gaussian Distribution): The normal distribution is a symmetric, bell-shaped distribution characterized by its mean and standard deviation. Many natural phenomena and measurements tend to follow a normal distribution. Here’s an example of generating and visualizing a normal distribution in R:

# Generate random data from a normal distribution
data <- rnorm(1000, mean = 0, sd = 1)

# Plot a histogram to visualize the distribution
hist(data, breaks = 20, col = "skyblue", main = "Normal Distribution")

Uniform Distribution: The uniform distribution is characterized by a constant probability of each value occurring within a specified range. Here’s an example of generating and visualizing a uniform distribution in R:

# Generate random data from a uniform distribution
data <- runif(1000, min = 0, max = 1)

# Plot a histogram to visualize the distribution
hist(data, breaks = 20, col = "lightgreen", main = "Uniform Distribution")

Exponential Distribution: The exponential distribution models the time between events in a Poisson process (events happening at a constant average rate). Here’s an example of generating and visualizing an exponential distribution in R:

# Generate random data from an exponential distribution
data <- rexp(1000, rate = 0.5)

# Plot a histogram to visualize the distribution
hist(data, breaks = 20, col = "lightcoral", main = "Exponential Distribution")

These are just a few examples of the many types of distributions in statistics. Each distribution has its own properties and applications in different fields of study. R provides various functions and libraries to generate, analyze, and visualize different distributions, allowing you to explore their characteristics and use them in your analyses.

Standard error

The standard error is a measure of how much the sample mean (or other statistic) is likely to vary from the true population mean. It provides a way to quantify the uncertainty or variability associated with sample estimates. In other words, it gives an estimate of the precision of your sample statistic in representing the population parameter.

Mathematically, the standard error of the mean (SEM) is calculated as:

Standard Error of the Mean (SEM) = Standard Deviation / √Sample Size

Where the standard deviation is from the sample, and the sample size is the number of observations in the sample.

Here’s an example of calculating the standard error of the mean using R:

# Example data
data <- c(10, 15, 20, 25, 30)

# Calculate mean and standard deviation
mean_value <- mean(data)
standard_deviation <- sd(data)

# Calculate standard error of the mean
sample_size <- length(data)
standard_error <- standard_deviation / sqrt(sample_size)

standard_error

## [1] 3.535534

In this example, we calculate the mean and standard deviation of the data vector. Then, we use these values to calculate the standard error of the mean. The result gives you an idea of how much the sample mean is likely to vary from the true population mean.

The standard error is particularly useful when you’re making inferences about a population based on a sample. For example, when you’re calculating confidence intervals or performing hypothesis tests, the standard error helps you assess the accuracy and reliability of your sample-based estimates.

Confidence intervals

A confidence interval is a range of values that is used to estimate the true population parameter with a certain level of confidence. It provides a measure of the uncertainty associated with sample estimates and allows you to specify a range within which the true parameter value is likely to fall.

When you calculate a point estimate (e.g., sample mean, sample proportion), it’s just a single value, and it may not perfectly represent the true population parameter. Confidence intervals give you a range of plausible values around this point estimate.

Confidence intervals are typically expressed with a confidence level, which is a measure of how confident you are that the true parameter lies within the interval. For example, a 95% confidence interval means that if you were to repeat the sampling and estimation process many times, about 95% of the resulting intervals would contain the true population parameter.

Here’s an example of calculating a confidence interval for the mean using R:

# Example data
data <- c(10, 15, 20, 25, 30)

# Calculate mean and standard deviation
mean_value <- mean(data)
standard_deviation <- sd(data)

# Sample size
sample_size <- length(data)

# Confidence level
confidence_level <- 0.95

# Calculate standard error
standard_error <- standard_deviation / sqrt(sample_size)

# Calculate margin of error
margin_of_error <- qt((1 + confidence_level) / 2, df = sample_size - 1) * standard_error

# Calculate confidence interval
confidence_interval_lower <- mean_value - margin_of_error
confidence_interval_upper <- mean_value + margin_of_error

confidence_interval_lower

## [1] 10.18378

confidence_interval_upper

## [1] 29.81622

In this example, we calculate a 95% confidence interval for the mean of the data vector. We first calculate the standard error and then use the t-distribution to find the margin of error. Finally, we calculate the lower and upper bounds of the confidence interval using the point estimate (sample mean) and the margin of error.

The confidence interval gives you a range within which you can be reasonably confident that the true population mean lies. The wider the interval, the more uncertain the estimate, and the narrower the interval, the more confident you can be in the estimate.

Hypothesis testing

Hypothesis testing is a fundamental statistical method used to make decisions about population parameters based on sample data. It involves setting up a null hypothesis (H0) and an alternative hypothesis (Ha), and then using sample data to determine whether there’s enough evidence to reject the null hypothesis in favor of the alternative hypothesis.

The process of hypothesis testing involves the following steps:

Formulate Hypotheses:

Null Hypothesis (H0): This is the default hypothesis that states there is no effect or no difference. It’s often denoted as “no effect” or “equal to.”
Alternative Hypothesis (Ha): This is the hypothesis you want to test. It represents the possibility of an effect or a difference. It’s often denoted as “has an effect” or “not equal to.”

Collect and Analyze Data: Gather sample data relevant to your hypothesis.
Calculate Test Statistic: Calculate a test statistic that summarizes the relationship between the sample data and the null hypothesis.
Determine Significance Level (Alpha): Choose a significance level (alpha) that represents the threshold for rejecting the null hypothesis. Common values are 0.05 or 0.01.
Compare Test Statistic with Critical Value or P-value: Depending on the type of test, compare the test statistic with a critical value from a distribution or calculate the p-value. The p-value represents the probability of obtaining results as extreme as those observed, assuming the null hypothesis is true.
Make a Decision:

If the p-value is less than the chosen significance level (alpha), you reject the null hypothesis in favor of the alternative hypothesis.
If the p-value is greater than or equal to alpha, you fail to reject the null hypothesis. Here’s an example of a hypothesis test in R using a one-sample t-test:

# Example data
data <- c(23, 25, 27, 22, 21, 26, 24, 29, 28, 30)

# Hypotheses
# H0: Mean = 25 (null hypothesis)
# Ha: Mean ≠ 25 (alternative hypothesis)

# Calculate test statistic and p-value
t_test <- t.test(data, mu = 25)

# Print the results
print(t_test)

## 
##  One Sample t-test
## 
## data:  data
## t = 0.52223, df = 9, p-value = 0.6141
## alternative hypothesis: true mean is not equal to 25
## 95 percent confidence interval:
##  23.33415 27.66585
## sample estimates:
## mean of x 
##      25.5

In this example, we perform a one-sample t-test to test whether the sample mean of the data vector is significantly different from 25. The t.test() function calculates the test statistic, the p-value, and provides information on whether the null hypothesis can be rejected.

Hypothesis testing is a crucial tool for making informed decisions in various fields by systematically evaluating evidence from sample data.

Type I and Type II errors

Type I and Type II errors are concepts related to hypothesis testing. They represent the incorrect decisions that can be made when conducting hypothesis tests.

Type I Error (False Positive): A Type I error occurs when the null hypothesis (H0) is incorrectly rejected when it’s actually true. In other words, you conclude that there is an effect or a difference when there is no actual effect or difference in the population. Example in R:

# True population mean
population_mean <- 100

# Sample data
sample_data <- c(105, 102, 98, 101, 104)

# Hypotheses
# H0: Mean = 100 (null hypothesis)
# Ha: Mean ≠ 100 (alternative hypothesis)

# Conduct t-test
t_test <- t.test(sample_data, mu = 100)

# Type I error
if (t_test$p.value < 0.05) {
  cat("Type I error occurred: Null hypothesis was rejected incorrectly.\n")
} else {
  cat("No Type I error occurred: Null hypothesis was not rejected.\n")
}

## No Type I error occurred: Null hypothesis was not rejected.

Type II Error (False Negative): A Type II error occurs when the null hypothesis (H0) is incorrectly not rejected when it’s actually false. In other words, you conclude that there is no effect or difference when there is an actual effect or difference in the population. Example in R:

# True population mean
population_mean <- 100

# Sample data
sample_data <- c(98, 101, 99, 103, 102)

# Hypotheses
# H0: Mean = 100 (null hypothesis)
# Ha: Mean ≠ 100 (alternative hypothesis)

# Conduct t-test
t_test <- t.test(sample_data, mu = 100)

# Type II error
if (t_test$p.value >= 0.05) {
  cat("Type II error occurred: Null hypothesis was not rejected incorrectly.\n")
} else {
  cat("No Type II error occurred: Null hypothesis was rejected.\n")
}

## Type II error occurred: Null hypothesis was not rejected incorrectly.

In both examples, we’re performing a two-sided t-test with a significance level of 0.05. The comments in the code indicate whether a Type I or Type II error occurred based on the p-value calculated by the t-test.

Type I and Type II errors are both important considerations in hypothesis testing, as they represent the trade-off between making the correct decision and the risk of making an incorrect decision. The choice of significance level and sample size can influence the likelihood of these errors.

z-core and z-test

A z-score is a standardized value that measures the number of standard deviations a data point is away from the mean of a distribution. It’s often used in hypothesis testing to determine how extreme or unusual a data point is compared to the overall distribution.

A z-test is a hypothesis test that uses the z-score to assess whether a sample mean is significantly different from a population mean when the population standard deviation is known. It’s suitable for larger sample sizes and situations where the population standard deviation is available.

The formula to calculate the z-score is: Z-Score = (X - Mean) / Standard Deviation

Where:

“Z-Score” represents the standardized score. “X” represents the data point you want to standardize. “Mean” refers to the mean (average) of the data set. “Standard Deviation” is the standard deviation of the data set.

Here’s an example of conducting a z-test in R:

# Example data
sample_mean <- 120
population_mean <- 100
population_sd <- 15
sample_size <- 30

# Calculate z-score
z_score <- (sample_mean - population_mean) / (population_sd / sqrt(sample_size))

# Significance level
alpha <- 0.05

# Calculate critical z-value for two-tailed test
critical_z <- qnorm(1 - alpha/2)

# Perform hypothesis test
if (abs(z_score) > critical_z) {
  cat("Reject null hypothesis: Sample mean is significantly different from population mean.\n")
} else {
  cat("Fail to reject null hypothesis: No significant difference between sample mean and population mean.\n")
}

## Reject null hypothesis: Sample mean is significantly different from population mean.

In this example, we’re testing whether a sample mean of 120 is significantly different from a population mean of 100, assuming a known population standard deviation of 15 and a sample size of 30. We calculate the z-score, determine the critical z-value for a two-tailed test at a 0.05 significance level using the qnorm() function, and then compare the calculated z-score with the critical value to make a hypothesis testing decision.

Remember that the z-test assumes that the population standard deviation is known. If the population standard deviation is not known, the t-test is a more appropriate choice.

z-score table (also known as a standard normal distribution table or z-table)

A z-score table (also known as a standard normal distribution table or z-table) is a table that provides values for the cumulative distribution function of the standard normal distribution. It’s used to find the area under the standard normal curve to the left of a given z-score.

The standard normal distribution is a specific normal distribution with a mean (μ) of 0 and a standard deviation (σ) of 1. Z-scores represent the number of standard deviations a data point is away from the mean in a standard normal distribution.

To use a z-score table, follow these steps:

Convert the z-score to a Positive Value: If you have a negative z-score, convert it to a positive value by using the symmetry property of the standard normal distribution. For example, if you have a z-score of -1.5, you can look up the positive z-score of 1.5 in the table.
Locate the Z-Score in the Table: Find the row that corresponds to the first digit(s) of the z-score and the column that corresponds to the second digit(s) of the z-score. The intersection of the row and column will provide the area under the standard normal curve to the left of that z-score.
Interpolate if Necessary: If the z-score falls between the values in the table, you might need to interpolate to get a more accurate area.
Calculate the Area Under the Curve: The value in the table represents the area under the standard normal curve to the left of the z-score. To find the area to the right of the z-score, subtract the table value from 1.

Here’s an example of how to use a z-score table:

Suppose you have a z-score of 1.75 and you want to find the area under the standard normal curve to the left of that z-score.

Convert the z-score to a positive value: 1.75
Locate the z-score in the table:

The first digit is 1, and the second digit is 7.
In the z-score table, the row for 1.7 intersects with the column for 0.05.

Interpolate if necessary: If the table value is not exact, you can interpolate between neighboring values.
Calculate the area: The table value corresponding to 1.75 is approximately 0.9599. So, the area to the left of the z-score is approximately 0.9599.

Remember that z-score tables are available in many statistics textbooks and online resources. These tables provide a quick way to find the cumulative probability associated with a specific z-score in a standard normal distribution.

In R, you can use the pnorm() function to calculate cumulative probabilities for the standard normal distribution. While there isn’t a built-in function that prints the entire z-score table like in textbooks, you can easily calculate specific values using pnorm().

The pnorm() function calculates the cumulative distribution function (CDF) for the standard normal distribution. Given a z-score, it provides the probability that a randomly selected value from a standard normal distribution is less than or equal to that z-score.

Here’s an example of how to use the pnorm() function to calculate cumulative probabilities (equivalent to the values in a z-score table):

# Calculate cumulative probability for a given z-score
z_score <- 1.75
cumulative_probability <- pnorm(z_score)

# Print the cumulative probability
cat("Cumulative Probability:", cumulative_probability, "\n")

## Cumulative Probability: 0.9599408

# Calculate probability to the right of the z-score (area to the left in the table)
probability_to_right <- 1 - cumulative_probability
cat("Probability to the right:", probability_to_right, "\n")

## Probability to the right: 0.04005916

In this example, we use the pnorm() function to calculate the cumulative probability for a z-score of 1.75. The result is the probability that a value from a standard normal distribution is less than or equal to 1.75. We also calculate the probability to the right of the z-score, which is equivalent to subtracting the cumulative probability from 1.

While this approach doesn’t provide a full z-score table, it allows you to calculate specific cumulative probabilities for given z-scores, which is often more flexible and practical for statistical analysis in R.

Hypothesis test for the mean when the population variance is known

Here’s an example of a hypothesis test for the mean when the population variance is known. We’ll use a one-sample z-test to determine if a sample mean is significantly different from a specified population mean.

Suppose we have the following scenario:

Scenario: A manufacturer claims that the average weight of their product is 500 grams. You collect a sample of 25 products and find the average weight to be 490 grams. The population standard deviation is known to be 20 grams. You want to test whether the sample provides enough evidence to reject the manufacturer’s claim.

Hypotheses:

Null Hypothesis (H0): The average weight of the products is 500 grams. (μ = 500)
Alternative Hypothesis (Ha): The average weight of the products is not 500 grams. (μ ≠ 500)

Here’s how you can perform the hypothesis test in R:

# Given data
sample_mean <- 490
population_mean <- 500
population_sd <- 20
sample_size <- 25

# Calculate z-score
z_score <- (sample_mean - population_mean) / (population_sd / sqrt(sample_size))

# Significance level
alpha <- 0.05

# Calculate critical z-values for two-tailed test
critical_z_lower <- qnorm(alpha / 2)
critical_z_upper <- qnorm(1 - alpha / 2)

# Perform hypothesis test
if (z_score < critical_z_lower || z_score > critical_z_upper) {
  cat("Reject null hypothesis: Sample mean is significantly different from population mean.\n")
} else {
  cat("Fail to reject null hypothesis: No significant difference between sample mean and population mean.\n")
}

## Reject null hypothesis: Sample mean is significantly different from population mean.

In this example, we calculate the z-score for the sample mean using the formula provided earlier. Then, we calculate the critical z-values for a two-tailed test at the 0.05 significance level using the qnorm() function. Finally, we compare the calculated z-score with the critical values to make a decision about whether to reject the null hypothesis.

Keep in mind that the z-test assumes the population variance is known. If the population variance is not known, a t-test should be used instead.

Hypothesis test for the mean when the population variance is unknown

Here’s an example of a hypothesis test for the mean when the population variance is unknown. We’ll use a one-sample t-test to determine if a sample mean is significantly different from a specified population mean.

Suppose we have the following scenario:

Scenario: A company claims that their new training program improves employee productivity, with an expected increase in productivity of 10 units. You collect a sample of 20 employees who underwent the training and find that the average increase in productivity is 8 units. The population standard deviation is not known. You want to test whether the sample provides enough evidence to support the company’s claim.

Hypotheses:

Null Hypothesis (H0): The average increase in productivity is not 10 units. (μ = 10)
Alternative Hypothesis (Ha): The average increase in productivity is 10 units. (μ ≠ 10)

Here’s how you can perform the hypothesis test in R:

# Given data
sample_mean <- 8
population_mean <- 10
sample_size <- 20
sample_sd <- 5  # Sample standard deviation

# Calculate t-score
t_score <- (sample_mean - population_mean) / (sample_sd / sqrt(sample_size))

# Degrees of freedom
df <- sample_size - 1

# Significance level
alpha <- 0.05

# Calculate critical t-values for two-tailed test
critical_t_lower <- qt(alpha / 2, df)
critical_t_upper <- qt(1 - alpha / 2, df)

# Perform hypothesis test
if (t_score < critical_t_lower || t_score > critical_t_upper) {
  cat("Reject null hypothesis: Sample mean is significantly different from population mean.\n")
} else {
  cat("Fail to reject null hypothesis: No significant difference between sample mean and population mean.\n")
}

## Fail to reject null hypothesis: No significant difference between sample mean and population mean.

In this example, we calculate the t-score for the sample mean using the formula provided earlier. We also calculate the degrees of freedom for the t-distribution. Then, we calculate the critical t-values for a two-tailed test at the 0.05 significance level using the qt() function. Finally, we compare the calculated t-score with the critical values to make a decision about whether to reject the null hypothesis.

Since the population variance is unknown, the sample standard deviation is used in the calculations. This example demonstrates how to perform a hypothesis test for the mean using a t-test when the population variance is not known.

Comparing two means when samples are dependent

When comparing two means for dependent samples, you’re often dealing with situations where the same subjects or items are measured before and after a treatment, or when measurements are taken on paired observations. This type of analysis is typically done using a paired t-test.

Here’s an example scenario:

Scenario: A fitness instructor wants to test if a new workout program improves the average number of push-ups done by participants. She records the number of push-ups each participant can do before and after the program. You have a dataset with the paired observations for each participant.

Hypotheses:

Null Hypothesis (H0): The mean difference in push-ups before and after the program is zero. (μ_diff = 0)
Alternative Hypothesis (Ha): The mean difference in push-ups before and after the program is not zero. (μ_diff ≠ 0)

Here’s how you can perform a paired t-test in R:

# Example data
before <- c(25, 28, 20, 22, 30, 18, 24, 27, 21, 23)
after <- c(28, 32, 22, 25, 32, 20, 28, 31, 24, 26)

# Perform paired t-test
paired_t_test <- t.test(before, after, paired = TRUE)

# Print the results
print(paired_t_test)

## 
##  Paired t-test
## 
## data:  before and after
## t = -11.619, df = 9, p-value = 1.013e-06
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -3.584086 -2.415914
## sample estimates:
## mean difference 
##              -3

In this example, we have the before and after vectors representing the number of push-ups done by participants before and after the program. We use the t.test() function with the paired = TRUE argument to perform a paired t-test. The result will include the t-statistic, degrees of freedom, p-value, and a confidence interval. You can use the p-value to make a decision about whether to reject the null hypothesis.

The paired t-test is appropriate when you have paired observations or measurements on the same subjects or items. It’s used to determine if there’s a statistically significant difference between the means of the paired observations.

Comparing two means when samples are independent

When comparing two means for independent samples, you’re dealing with situations where you have two separate groups of observations that are not related to each other. This type of analysis is typically done using an independent samples t-test.

Here’s an example scenario:

Scenario: You want to test if a new weight loss supplement has a significant effect on weight loss. You have two groups of participants: one group took the supplement, and the other group followed a placebo. You have weight measurements before and after the treatment for each group.

Hypotheses:

Null Hypothesis (H0): The mean weight loss in the supplement group is equal to the mean weight loss in the placebo group. (μ_supplement = μ_placebo)
Alternative Hypothesis (Ha): The mean weight loss in the supplement group is not equal to the mean weight loss in the placebo group. (μ_supplement ≠ μ_placebo)

Here’s how you can perform an independent samples t-test in R:

# Example data for the supplement group
supplement <- c(175, 170, 180, 185, 160, 165, 170, 175, 168, 173)

# Example data for the placebo group
placebo <- c(170, 165, 175, 180, 158, 163, 168, 173, 166, 171)

# Perform independent samples t-test
independent_t_test <- t.test(supplement, placebo)

# Print the results
print(independent_t_test)

## 
##  Welch Two Sample t-test
## 
## data:  supplement and placebo
## t = 1.0539, df = 17.7, p-value = 0.3061
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.18709  9.58709
## sample estimates:
## mean of x mean of y 
##     172.1     168.9

In this example, we have the supplement and placebo vectors representing weight measurements for the two groups. We use the t.test() function without the paired argument to perform an independent samples t-test. The result will include the t-statistic, degrees of freedom, p-value, and a confidence interval. You can use the p-value to make a decision about whether to reject the null hypothesis.

The independent samples t-test is appropriate when you have two separate groups and you want to compare the means of their observations. It’s used to determine if there’s a statistically significant difference between the means of the two groups.

Linear regression model

Linear regression is a statistical method used to model the relationship between a dependent variable (also called the response or outcome variable) and one or more independent variables (also called predictors or features). The goal of linear regression is to find the best-fitting linear equation that represents the relationship between the variables.

In its simplest form, linear regression assumes a linear relationship between the independent variable(s) and the dependent variable. The linear equation is represented as:

\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n + \epsilon \]

Where:

\(Y\) represents the dependent variable.
\(X_1, X_2, \ldots, X_n\) are the independent variables.
\(\beta_0\) is the intercept (constant) term.
\(\beta_1, \beta_2, \ldots, \beta_n\) are the coefficients for each independent variable.
\(\epsilon\) represents the error term, accounting for unexplained variation in the model.

The main goal of linear regression is to estimate the coefficients (\(\beta\) values) that minimize the sum of squared differences between the observed dependent variable values and the predicted values generated by the linear equation.

Linear regression models can be used for various purposes:

Prediction: Given new values of the independent variables, the model can predict the corresponding dependent variable value.
Inference: Linear regression can help you understand how changes in the independent variables are associated with changes in the dependent variable.
Relationship Assessment: It allows you to assess the strength and direction of relationships between variables.

Linear regression can also be extended to handle multiple predictors, interactions between predictors, and nonlinear relationships through techniques like polynomial regression and transformation of variables.

In R, you can use the lm() function to fit linear regression models to data and obtain the estimated coefficients, residuals, and various statistical summaries.

Correlation and Regression

Correlation and regression are both statistical techniques used to analyze relationships between variables, but they serve different purposes and provide different types of information.

Correlation:

Correlation measures the strength and direction of the linear relationship between two continuous variables. It quantifies how changes in one variable are associated with changes in another variable. Correlation ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear correlation.

Key points about correlation:

Correlation does not imply causation. Even if two variables are strongly correlated, it doesn’t mean that changes in one variable cause changes in the other.
Correlation only assesses the linear relationship between variables. Nonlinear relationships may not be captured accurately.
Correlation does not distinguish between dependent and independent variables.

Regression:

Regression aims to model the relationship between one or more independent variables and a dependent variable. It helps you understand how changes in the independent variables are associated with changes in the dependent variable. Regression provides a predictive model that estimates the dependent variable based on the values of the independent variables.

Key points about regression:

Regression can be used for prediction and inference. It allows you to make predictions about the dependent variable given values of the independent variables, and it also helps you understand the relationship between variables.
Regression models can be linear or nonlinear, and they can include multiple independent variables and interactions between them.
Regression assumes a causal relationship between the independent and dependent variables, which makes it suitable for making causal inferences in some cases.

In summary, correlation assesses the strength and direction of the linear relationship between variables, while regression aims to model and quantify the relationship between variables, enabling prediction and inference. Both techniques are valuable tools for exploring and analyzing relationships in data, but they address different aspects of the relationship between variables.

The geometrical representation of a simple linear regression model

The geometrical representation of a simple linear regression model involves visualizing the data points, the best-fitting regression line, and the residuals. The regression line is the line that minimizes the sum of squared residuals, which are the vertical distances between the data points and the regression line.

Here’s an example of creating a geometrical representation of a simple linear regression model in R:

# Example data
x <- c(2, 3, 5, 7, 9)
y <- c(5, 6, 8, 10, 12)

# Fit linear regression model
model <- lm(y ~ x)

# Plot the data points
plot(x, y, main = "Geometrical Representation of Regression", 
     xlab = "X", ylab = "Y", xlim = c(0, 12), ylim = c(0, 15))

# Add the regression line
abline(model, col = "blue")

# Add residuals as dashed lines
residuals <- resid(model)
segments(x, y, x, y - residuals, col = "red", lty = 2)

In this example:

We have example data with two variables, x and y.
We fit a simple linear regression model using the lm() function (y ~ x indicates that y is the dependent variable and x is the independent variable).
We create a scatter plot of the data points using plot().
We add the regression line to the plot using abline().
We calculate the residuals (resid()) and add dashed lines to connect each data point to its corresponding position on the regression line.

The plot visually demonstrates how the regression line minimizes the sum of squared residuals. The residuals (red dashed lines) represent the vertical distances between the data points and the regression line. The closer the data points are to the regression line, the smaller the residuals.

Keep in mind that this example shows a simple linear regression with one independent variable. For multiple regression (more than one independent variable), the geometrical representation becomes more complex, as it involves higher-dimensional spaces.

Elements of simple linear regression

\(Y\) represents the dependent variable. - \(X_1, X_2, \ldots, X_n\) are the independent variables. - \(\beta_0\) is the intercept (constant) term. - \(\beta_1, \beta_2, \ldots, \beta_n\) are the coefficients for each independent variable. - \(\epsilon\) represents the error term, accounting for unexplained variation in the model.

Simple linear regression involves modeling the relationship between two variables: a dependent variable (also called the response or outcome variable) and an independent variable (also called the predictor or feature variable). The goal is to find the best-fitting linear equation that describes the relationship between these variables. The key elements of a simple linear regression model are as follows:

Dependent Variable (Y): The variable that you want to predict or explain. It’s the outcome or response variable that you’re trying to model based on the independent variable.
Independent Variable (X): The variable that you believe has an influence on the dependent variable. It’s the predictor variable that you use to explain or predict changes in the dependent variable.
Regression Equation: The linear equation that describes the relationship between the dependent and independent variables. It’s represented as: \[ Y = \beta_0 + \beta_1X + \epsilon \] where \(\beta_0\) is the intercept (constant), \(\beta_1\) is the coefficient for the independent variable, and \(\epsilon\) is the error term.
Intercept (\(\beta_0\)): The point where the regression line crosses the y-axis (when \(X\) = 0). It represents the expected value of the dependent variable when the independent variable is zero.
Slope (\(\beta_1\)): The change in the dependent variable for a one-unit change in the independent variable. It quantifies the rate of change in the dependent variable with respect to changes in the independent variable.
Error Term (\(\epsilon\): The unobserved random variability or noise that affects the dependent variable. It accounts for the differences between the observed data points and the predicted values from the regression equation.
Residuals: The differences between the observed values of the dependent variable and the values predicted by the regression equation. Residuals are used to evaluate the goodness of fit of the model.
Least Squares Criterion: The method used to estimate the coefficients (\(\beta\) values) that minimize the sum of squared differences between the observed dependent variable values and the predicted values generated by the regression equation.
Assumptions: Simple linear regression relies on assumptions, including linearity (the relationship between variables is linear), independence of errors (residuals are not correlated), constant variance of residuals (homoscedasticity), and normally distributed residuals.

Simple linear regression is a foundational technique in statistics and forms the basis for more complex regression analyses. It provides insights into the relationship between two variables and can be used for prediction and inference.

Linear regression example

Here’s an example of simple linear regression in R using a built-in dataset called “mtcars.” We’ll model the relationship between the miles per gallon (mpg) and the horsepower (hp) of cars:

# Load the dataset
data(mtcars)

# Perform simple linear regression
model <- lm(mpg ~ hp, data = mtcars)

# Print the summary of the regression model
summary(model)

## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

# Plot the data and the regression line
plot(mtcars$hp, mtcars$mpg, xlab = "Horsepower", ylab = "Miles per Gallon",
     main = "Simple Linear Regression: mpg ~ hp")
abline(model, col = "blue")

n this example:

We load the “mtcars” dataset, which is available by default in R.
We use the lm() function to perform simple linear regression, with “mpg” as the dependent variable and “hp” as the independent variable.
We print the summary of the regression model using summary(), which provides information about the estimated coefficients, standard errors, t-values, p-values, and more.
We create a scatter plot of the data using plot(), with “hp” on the x-axis and “mpg” on the y-axis.
We add the regression line to the plot using abline(), which represents the best-fitting linear equation based on the regression model.

The output of summary(model) provides insights into the strength and significance of the relationship between the variables, as well as statistical details about the regression coefficients.

The plot visually displays the data points and the regression line, helping you understand how the linear regression model fits the data.

How to interpret the regression table?

Interpreting the regression table from a linear regression analysis involves understanding the various coefficients, statistical measures, and their significance. Let’s go through the key components typically found in a regression table:

Coefficients:

Intercept (Intercept/Constant): This represents the predicted value of the dependent variable when all independent variables are zero. In some contexts, it might not have a meaningful interpretation.
Slope (Coefficient of the Independent Variable): This represents the change in the dependent variable for a one-unit change in the corresponding independent variable while holding other variables constant.
Standard Error (SE): The standard deviation of the estimated coefficient. It indicates the variability of the estimated coefficient.

Statistical Measures:

t-value (t): The ratio of the estimated coefficient to its standard error. It assesses the significance of the coefficient.
p-value (Pr(>|t|)): The probability of observing a t-value as extreme as the one calculated, assuming the null hypothesis that the coefficient is not significant. A low p-value (typically below 0.05) suggests the coefficient is significant.
Degrees of Freedom (df): The degrees of freedom associated with the t-distribution. It’s usually the sample size minus the number of coefficients estimated (including the intercept).
Residual Standard Error (RSE/Residuals’ Standard Deviation): The standard deviation of the residuals, representing the average amount by which the observed values deviate from the fitted values.
Multiple R-squared (R-squared): A measure of the proportion of variance in the dependent variable explained by the independent variables in the model. It ranges from 0 to 1, where higher values indicate better fit.
Adjusted R-squared (Adj R-squared): A modification of R-squared that penalizes adding insignificant predictors to the model. It’s useful when comparing models with different numbers of predictors.
F-statistic (F): The ratio of the explained variance to the unexplained variance in the model. It tests the overall significance of the model. A low p-value suggests the model is significant.

Residuals and Diagnostic Measures: Residuals: The differences between observed and predicted values.

Residuals’ t-value (t-value of Residuals): The t-value of the residuals, used for assessing heteroscedasticity.
p-value of Residuals (Pr(>|t|) of Residuals): The p-value associated with the residuals’ t-value.
Durbin-Watson Statistic: Measures the presence of autocorrelation in the residuals. Values close to 2 suggest no autocorrelation.

Remember, interpreting regression results involves considering the context of your data and research question. Focus on the coefficients’ magnitudes, signs, and significance, as well as measures of model fit and residuals’ behavior.

Interpretation of simple linear regression example using ‘’mtcars’’ data set

Let’s interpret the results of the previous simple linear regression example using the “mtcars” dataset where we modeled the relationship between miles per gallon (mpg) and horsepower (hp) of cars.

Here’s the summary output we obtained from the regression model:

# Load the dataset
data(mtcars)

# Perform simple linear regression
model <- lm(mpg ~ hp, data = mtcars)

# Print the summary of the regression model
summary(model)

## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

# Plot the data and the regression line
plot(mtcars$hp, mtcars$mpg, xlab = "Horsepower", ylab = "Miles per Gallon",
     main = "Simple Linear Regression: mpg ~ hp")
abline(model, col = "blue")

Interpretation:

Intercept (30.09886): When the horsepower (hp) is zero (which is not practically meaningful for this context), the estimated miles per gallon (mpg) is 30.09886.
Horsepower Coefficient (-0.06823): For each unit increase in horsepower (hp), the estimated miles per gallon (mpg) decreases by 0.06823, holding other variables constant.
Significance: Both the intercept and the horsepower coefficient are statistically significant. The p-values (Pr(>|t|)) are much smaller than the commonly used significance level of 0.05, indicating that these coefficients are likely not due to random chance.
Residual Standard Error (3.863): The standard deviation of the residuals is approximately 3.863. This represents the average amount by which the observed miles per gallon values deviate from the fitted values.
**R-squared (0.6024) and Adjusted R-squared (0.5892): The R-squared value of 0.6024 indicates that about 60.24% of the variance in miles per gallon can be explained by the linear relationship with horsepower. The adjusted R-squared adjusts for the number of predictors and provides a slightly lower value, 58.92%.
F-statistic (45.46) and p-value (1.788e-07): The F-statistic tests the overall significance of the model. The very small p-value indicates that the model as a whole is statistically significant.

In conclusion, based on this analysis, there is a statistically significant negative linear relationship between horsepower and miles per gallon. As horsepower increases, the estimated miles per gallon decreases. The model explains around 60.24% of the variability in miles per gallon. Keep in mind that this interpretation is based on the assumptions of linear regression and the context of the dataset.

Sum of squares total, The sum of squares regression and The sum of squares error

In the context of linear regression analysis, the total sum of squares (SST), the sum of squares regression (SSR), and the sum of squares error (SSE) are essential components used to understand the variability in the data and assess the goodness of fit of the regression model.

Here’s what each of these terms represents:

Total Sum of Squares (SST):

SST represents the total variability in the dependent variable (Y).
It measures the total deviation of each data point from the mean of the dependent variable.
It is computed as the sum of the squared differences between each observed Y value and the overall mean of Y.

Formula: \[SST = \sum_{i=1}^{n}(Y_i - \bar{Y})^2\]

Sum of Squares Regression (SSR):

** SSR represents the variability in the dependent variable that is explained by the regression model. ** It measures the deviation of the predicted values (based on the regression model) from the overall mean of Y. ** It is computed as the sum of the squared differences between the predicted Y values (based on the regression equation) and the overall mean of Y.

Formula: \[SSR = \sum_{i=1}^{n}(\hat{Y}_i - \bar{Y})^2\]

Sum of Squares Error (SSE):

** SSE represents the unexplained variability in the dependent variable. ** It measures the deviation of each observed Y value from its corresponding predicted value (based on the regression model). ** It is computed as the sum of the squared differences between each observed Y value and its corresponding predicted Y value.

Formula: \[SSE = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2\]

In R, you can calculate these sums of squares as follows:

# Example data
observed_y <- c(5, 6, 8, 10, 12)
predicted_y <- c(4.8, 6.2, 7.9, 9.8, 11.9)  # Predicted values from a regression model

# Calculate the mean of observed Y
mean_y <- mean(observed_y)

# Calculate the Total Sum of Squares (SST)
sst <- sum((observed_y - mean_y)^2)

# Calculate the Sum of Squares Regression (SSR)
ssr <- sum((predicted_y - mean_y)^2)

# Calculate the Sum of Squares Error (SSE)
sse <- sum((observed_y - predicted_y)^2)

# Print the results
cat("Total Sum of Squares (SST):", sst, "\n")

## Total Sum of Squares (SST): 32.8

cat("Sum of Squares Regression (SSR):", ssr, "\n")

## Sum of Squares Regression (SSR): 31.9

cat("Sum of Squares Error (SSE):", sse, "\n")

## Sum of Squares Error (SSE): 0.14

In this example, we have a simple set of observed Y values and corresponding predicted Y values (based on a hypothetical regression model). We calculate SST, SSR, and SSE using the formulas described above. These values help us assess how much of the total variability in Y is explained by the regression model and how much remains unexplained (SSE).

Sum of squares total (SST), the sum of squares regression (SSR), and the sum of squares error (SSE) for the linear regression model performed earlier using the “mtcars” data set

Let’s calculate the sum of squares total (SST), the sum of squares regression (SSR), and the sum of squares error (SSE) for the linear regression model we performed earlier using the “mtcars” data set, where we modeled the relationship between miles per gallon (mpg) and horsepower (hp).

First, we’ll need to obtain the predicted values from the regression model. Then, we can calculate these sums of squares:

# Load the dataset
data(mtcars)

# Perform simple linear regression
model <- lm(mpg ~ hp, data = mtcars)

# Get observed and predicted values
observed_mpg <- mtcars$mpg
predicted_mpg <- predict(model)

# Calculate the mean of observed mpg
mean_mpg <- mean(observed_mpg)

# Calculate the Total Sum of Squares (SST)
sst <- sum((observed_mpg - mean_mpg)^2)

# Calculate the Sum of Squares Regression (SSR)
ssr <- sum((predicted_mpg - mean_mpg)^2)

# Calculate the Sum of Squares Error (SSE)
sse <- sum((observed_mpg - predicted_mpg)^2)

# Print the results
cat("Total Sum of Squares (SST):", sst, "\n")

## Total Sum of Squares (SST): 1126.047

cat("Sum of Squares Regression (SSR):", ssr, "\n")

## Sum of Squares Regression (SSR): 678.3729

cat("Sum of Squares Error (SSE):", sse, "\n")

## Sum of Squares Error (SSE): 447.6743

In this code:

We first fit the linear regression model using the “mpg” and “hp” variables from the “mtcars” dataset.
We then calculate the observed “mpg” values and predict “mpg” values using the predict() function based on the regression model.
Next, we calculate the mean of observed “mpg” values, which is used in the calculations for SST and SSR.
Finally, we calculate SST, SSR, and SSE using the formulas discussed earlier.

These calculations help us understand the partitioning of the total variability in “mpg” into the variability explained by the regression model (SSR) and the unexplained variability (SSE).

R-squared (R²)

R-squared (R²), also known as the coefficient of determination, is a statistical measure that quantifies the proportion of the variance in the dependent variable that is explained by the independent variables in a regression model. It is a value between 0 and 1, where:

R² = 0 indicates that the independent variables do not explain any of the variability in the dependent variable.
R² = 1 indicates that the independent variables explain all of the variability in the dependent variable.

In the context of linear regression, R² is interpreted as the proportion of the variance in the dependent variable that is “captured” or “accounted for” by the regression model. It provides a measure of how well the model fits the data. Higher R² values indicate a better fit, meaning that a larger proportion of the variance in the dependent variable is explained by the model.

However, R² alone does not tell you whether the model is good or bad. It should be used in conjunction with other information, such as the context of the data and the significance of the coefficients. A high R² does not necessarily imply that the model is meaningful or that it has predictive power. Conversely, a low R² does not necessarily mean that the model is useless; it might just indicate that the relationship is more complex or that other important variables are missing from the model.

In the example of the linear regression model we performed on the “mtcars” dataset, here’s how we can interpret R²:

# Load the dataset
data(mtcars)

# Perform simple linear regression
model <- lm(mpg ~ hp, data = mtcars)

# Print the summary of the regression model
summary(model)

## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

Multiple R-squared (R²): The value of 0.6024 indicates that approximately 60.24% of the variance in the “mpg” (miles per gallon) variable is explained by the linear relationship with “hp” (horsepower). In other words, about 60.24% of the variability in “mpg” can be accounted for by the linear regression model.

Adjusted R-squared (Adj R²): The adjusted R-squared value of 0.5892 adjusts for the number of predictors in the model. It’s slightly lower than the multiple R-squared and is used when comparing models with different numbers of predictors. It’s useful because it penalizes the inclusion of insignificant predictors.

In this case, the R² values suggest that the linear regression model with “hp” as the predictor explains a substantial portion of the variance in “mpg.” However, the interpretation of R² should be combined with other diagnostic measures and domain knowledge to assess the overall quality and appropriateness of the model.

Statistics with R and ChatGPT

Mardan Mirzaguliyev

2023-09-07

Statistics

Population and Sample

Mean, median and mode

Skewness

Variance

Standard deviation

Coefficient of variation

Covariance and correlation

Distributions

Standard error

Confidence intervals

Hypothesis testing

z-core and z-test

z-score table (also known as a standard normal distribution table or z-table)

Hypothesis test for the mean when the population variance is known

Hypothesis test for the mean when the population variance is unknown

Comparing two means when samples are dependent

Comparing two means when samples are independent

Linear regression model

Correlation and Regression

The geometrical representation of a simple linear regression model

Elements of simple linear regression

Linear regression example

How to interpret the regression table?

Interpretation of simple linear regression example using ‘’mtcars’’ data set

Sum of squares total, The sum of squares regression and The sum of squares error

Sum of squares total (SST), the sum of squares regression (SSR), and the sum of squares error (SSE) for the linear regression model performed earlier using the “mtcars” data set

R-squared (R²)