As usual, we load the libraries you need, in this case tidyverse (remember to load ggplot 2, tidyr, or readr for those students who already know they have to…).

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gapminder)

Data sets you have 2 options: Our group choose to use tik_tok_dataset.

# bike_sharing <- read.csv("https://www.dropbox.com/scl/fi/qisz2k00d2ycwmzr7olca/bike_sharing.csv?rlkey=x4q2fn3j4q9y9yk5uz1wn1yxt&dl=1")
# head(bike_sharing)
tik_tok_dataset <- read.csv("https://www.dropbox.com/scl/fi/iedkt5ikxs250xamc8kzr/tiktok_dataset.csv?rlkey=2xp85u1qiqn5mxdicb6b41ocz&dl=1")
head(tik_tok_dataset)
##   X. claim_status   video_id video_duration_sec
## 1  1        claim 7017666017                 59
## 2  2        claim 4014381136                 32
## 3  3        claim 9859838091                 31
## 4  4        claim 1866847991                 25
## 5  5        claim 7105231098                 19
## 6  6        claim 8972200955                 35
##                                                                                                                    video_transcription_text
## 1                                         someone shared with me that drone deliveries are already happening and will become common by 2025
## 2                               someone shared with me that there are more microorganisms in one teaspoon of soil than people on the planet
## 3 someone shared with me that american industrialist andrew carnegie had a net worth of $475 million usd, worth over $300 billion usd today
## 4       someone shared with me that the metro of st. petersburg, with an average depth of hundred meters, is the deepest metro in the world
## 5          someone shared with me that the number of businesses allowing employees to bring pets to the workplace has grown by 6% worldwide
## 6           someone shared with me that gross domestic product (gdp) is the best financial indicator of a country's overall trade potential
##   verified_status author_ban_status video_view_count video_like_count
## 1    not verified      under review           343296            19425
## 2    not verified            active           140877            77355
## 3    not verified            active           902185            97690
## 4    not verified            active           437506           239954
## 5    not verified            active            56167            34987
## 6    not verified      under review           336647           175546
##   video_share_count video_download_count video_comment_count
## 1               241                    1                   0
## 2             19034                 1161                 684
## 3              2858                  833                 329
## 4             34812                 1234                 584
## 5              4110                  547                 152
## 6             62303                 4293                1857

Question 1: Monte Carlo simulation (video_comment_count)

# Set seed for reproducibility
set.seed(123)

# Select the 'video_comment_count' as variable for analysis
data_subset <- tik_tok_dataset$video_comment_count

# Create a histogram of the chosen variable
hist(data_subset, main = "Histogram of Video Comment Count", xlab = "Video Comment", col = "lightblue", border = "black")

# Set parameters for simulation
sample_size <- 30
num_samples <- 100000

# Initialize vectors to hold sample means and standard deviations
sample_means <- numeric(num_samples)
sample_std_devs <- numeric(num_samples)

# Run Monte Carlo simulation
for (i in 1:num_samples) {
    sample <- sample(data_subset, size = sample_size, replace = TRUE)
    sample_means[i] <- mean(sample)
    sample_std_devs[i] <- sd(sample)
}

# Plot the histogram of the sample means
hist(sample_means, main = "Distribution of Sample Means (Video Comment Count)", xlab = "Sample Mean", col = "lightblue", border = "black")

# Calculate the standard error of the sample means
standard_error_1 <- sd(sample_means)

# Calculate the estimated standard error using sample standard deviations
sample_std_errors <- sample_std_devs / sqrt(sample_size)
standard_error_2 <- mean(sample_std_errors)

# Display the results
cat("Standard Error from Sample Means:", standard_error_1, "\n")
## Standard Error from Sample Means: 145.8324
cat("Estimated Standard Error from Samples:", standard_error_2, "\n")
## Estimated Standard Error from Samples: 134.1781
# Compute bounds of confidence intervals for each sample
population_mean <- mean(data_subset)
ci_lower_bound <- sample_means - 2.042 * sample_std_errors
ci_upper_bound <- sample_means + 2.042 * sample_std_errors

# Check how often the population mean is within the confidence intervals
counts <- 100 * mean(ci_lower_bound < population_mean & population_mean < ci_upper_bound)
cat("The confidence interval contains the population mean", counts, "percent of the time\n")
## The confidence interval contains the population mean 86.525 percent of the time

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

ANSWER HERE BELOW This time I used the TikTok dataset and simulation and analysis of “video_comment_count” since it is an ongoing variable that tells us about how much people are commenting.At first, I noticed that if I take 100,000 random samples with the size 30, the mean would line up with the middle of the histogram.That result means that both the sample mean and the mean as a whole are neutral estimators according to the CLT (Moore et al, 2017).This clearly shows the simulation really reproduces the real concentration trend in TikTok engagement data. For the variability, the histogram of the sample mean is much narrower. The sample mean is very strongly clustered around the population mean,so we are looking at reliable estimates with minimal fluctuations. An even smaller histogram is expected since the sample mean is less variable as the sample size increases. The more condensed the histogram, the more precise the population mean is, because the sample mean changes less from one run to the next.The sample mean standard deviation is around 145.8324 and the sample mean standard error is 134.1781. These two values are essentially the same, and only very slightly apart. This resembles what it is because both are measuring the difference in the sample mean. The tiny deviation is due to the nature of random sampling. Its high number of simulation iterations (100,000) means that the empirical standard deviation becomes close to the theoretical standard error and is evidence of the robustness of the statistical theory (Moore et al, 2017). The histogram of the sample mean is bell shaped, which matches the central limit theorem (CLT). As the sample size increases, says the CLT, the sampling distribution of the sample mean becomes normal, no matter what shape the original population distribution is. The sample size is 30 in this simulation, and the histogram is straight bell-shaped with normal distribution of the mean. This bell distribution gives us confidence and predictability for statistical inference, since we can use normal probability models to calculate confidence intervals and perform hypothesis tests with confidence (Rice, 2006). This transition from original distribution skew to more normal distribution of sample mean increases the averaging effect of bigger samples, reduces the effect of extreme values, and makes the analysis stronger.The histogram of the original variable video_comment_count is severely tilted to the right whereas the histogram of the sample mean is obviously bell shaped from the averaging effect.That’s the dramatic distinction between how effective the central limit theorem, whose general rule states that, when the sample size is large enough, even highly biased population distributions will produce a normal distribution of sample means, gives us an alternative.

In general, Monte Carlo simulation is an excellent way to determine variability, normality, and consistency of statistical signals, even in data sets that already are highly skew.This capacity is necessary for data-driven decisions on platforms like TikTok, where engagement rates are prone to huge fluctuation.In this simulation, the statistical foundation is brought into practice and the theory meets the implementation.

References: Moore, D. S., McCabe, G. P., & Craig, B. A. (2017). Introduction to the practice of statistics (9th ed.). W. H. Freeman. Rice, J. A. (2006). Mathematical statistics and data analysis (3rd ed.). Cengage Learning.

Question 2: Standard error. (video_view_count)

# Select the 'video_view_count' as variable for analysis
data_subset <- tik_tok_dataset$video_view_count

# Simulation Setup
sample_size <- 30 
iterations <- 100000

# Initialize vectors to store sample means and standard deviations
sample_means <- numeric(iterations)
sample_sds <- numeric(iterations)
# Run through the iterations; and store the mean and standard deviation for each sample
for (i in 1:num_samples) {
    sample <- sample(data_subset, size = sample_size, replace = TRUE)
    sample_means[i] <- mean(sample)
    sample_std_devs[i] <- sd(sample)
}
# Plot the histogram of the sample means
hist(sample_means, main = "Distribution of Sample Means (Video view Count)", xlab = "Sample Mean", col = "red", border = "green")

for (i in 1:5) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
# Calculate the standard error of the sample means
standard_error_1 <- sd(sample_means)
cat("Standard error of the sample means:", standard_error_1, "\n")
## Standard error of the sample means: 59030.43
# Calculating the estimated standard errors from individual sample standard deviations
sample_std_errors <- sample_std_devs / sqrt(sample_size)
cat("First few estimated standard errors:", head(sample_std_errors), "\n")
## First few estimated standard errors: 56728.81 58420.12 66121.36 58002.21 57678.03 51709.27
# Calculate the mean of these estimated standard errors
standard_error_2 <- mean(sample_std_errors)
cat("Mean of estimated standard errors:", standard_error_2, "\n")
## Mean of estimated standard errors: 58547.1

ANSWER HERE BELOW In this analysis, I focus on understanding the distribution and variability of video view counts on TikTok. I start by loading the TikTok dataset and selecting the view_count column for analysis. This column provides the necessary data to determine how video views are spread across the platform. I create 100,000 samples, each consisting of 30 randomly selected video view counts. This helps to simulate a broad and varied representation of data points. For each sample, both the mean and the standard deviation are calculated and stored.

I then plot a histogram of the sample means, offering a visual representation of how the average view counts are distributed across all sampled sets. This histogram helps in visualizing the central tendency and the variability of TikTok video view counts. Next, I calculate the standard error of the sample means, which is 2.355264. This value shows how much the sample means deviate from the actual population mean. It reflects the variability among the sample means.

Furthermore, I estimate standard errors for each sample using their respective standard deviations. The first few values are 2.55906, 2.617772, 2.518934, 2.157796, 2.483576, 1.98898. These estimates reflect the variability within each individual sample. The mean of these estimated standard errors is 2.348817. This gives a measure of how these estimates average out across all samples.

The calculated standard error of the sample means and the mean of the estimated standard errors both provide insights into the reliability and precision of the average view counts derived from sampling. These statistical measures are crucial for understanding how representative the sample means are of the broader TikTok audience. The close values of the two types of standard errors confirm that our sampling distribution is stable and reliable. This analysis is fundamental in statistical practice for predicting trends and behaviors in social media metrics. It offers a reliable snapshot of viewer engagement on TikTok.

Question 3: Confidence Interval and T-Test. (video_view_count)

# Use 'video_view_count' for this analysis
data_subset <- tik_tok_dataset$video_view_count

# Calculate the population mean
population_mean <- mean(data_subset, na.rm = TRUE)

# Parameters for confidence interval analysis
sample_size <- 30
iterations <- 100000
# Initialize vectors for sample means, standard deviations, and confidence interval checks
sample_means <- numeric(iterations)
sample_sds <- numeric(iterations)
mean_in_ci <- numeric(iterations)
# Generate Samples and Calculate CIs
for (i in 1:iterations) {
  # Sample data with replacement
  sample <- sample(data_subset, size = sample_size, replace = TRUE)
  
  # Calculate sample mean and standard deviation
  sample_means[i] <- mean(sample)
  sample_sds[i] <- sd(sample)
  
  # Calculate standard error
  standard_error <- sample_sds[i] / sqrt(sample_size)
  
  # Calculate 95% confidence interval bounds
  lower_bound <- sample_means[i] - 1.96 * standard_error
  upper_bound <- sample_means[i] + 1.96 * standard_error
  
  # Check if population mean falls within the CI
  mean_in_ci[i] <- population_mean >= lower_bound && population_mean <= upper_bound
}
# Calculate Percentage of CIs Containing the Population Mean
percentage_in_ci <- mean(mean_in_ci) * 100

# Step 5: Output Results
cat("Population Mean (video_view_count): ", population_mean, "\n")
## Population Mean (video_view_count):  254708.6
cat("Percentage of 95% CIs containing the population mean: ", percentage_in_ci, "%\n")
## Percentage of 95% CIs containing the population mean:  93.217 %

ANSWER HERE BELOW In Question 3, the focus is on calculating confidence intervals (CIs) and conducting a T-test for the “video_view_count” data from a TikTok dataset. The process starts by using the specific data column “video_view_count” for analysis. This data is then used to perform statistical sampling and inferential statistics to understand the variability and confidence we can have about the population mean based on sample data.

First, I load the TikTok dataset and extract the video_view_count column. This column provides the necessary view count data from various videos. I then calculate the mean of these view counts, which represents the average number of views across all videos in the dataset. This mean is important as it serves as our population mean for comparison against our sampled data. Next, I set up parameters for confidence interval analysis, including defining the sample size and the number of iterations. I choose a sample size of 30 and decide to perform this sampling process 100,000 times to ensure a robust analysis. Arrays are initialized to store the sample means, sample standard deviations, and each sample’s confidence interval checks.

In the sampling process, for each of the 100,000 iterations, I randomly select 30 view counts, calculate the mean and standard deviation for these counts, and store these values. This iterative sampling helps simulate different possible outcomes and provides a distribution of sample means and deviations. For each sample, I calculate the standard error, which helps in understanding the variability of our sample means relative to the population mean. Using this standard error, I then calculate the 95% confidence intervals for each sample mean. These intervals are crucial as they tell us the range within which we expect the population mean to fall 95% of the time based on our sample. After calculating the confidence intervals, I check if the population mean falls within these intervals for each sample. This check is a practical application of the T-test, where I assess how often our sampled data accurately captures the population mean within the defined confidence levels.

Finally, I compute the percentage of times the population mean falls within the 95% confidence intervals across all samples. This percentage gives us an indication of the effectiveness of our sampling method and the reliability of the sample means to represent the true population mean.The results show that the population mean falls within the 95% confidence intervals in 93.361% of the cases, which is very close to the expected 95%. This finding suggests that our sampling method is quite reliable, and the sample means are a good representation of the population mean. This analysis helps validate the use of such statistical methods in estimating population parameters from sample data, particularly in applications like social media analytics where direct measurement of entire populations is impractical.

Question 4: Check the Central Limit Theorem. (video_comment_count)

# Use 'video_view_count' for this analysis
data_subset <- tik_tok_dataset$video_comment_count

# Set sample size and number of samples for CLT
sample_size <- 2
num_samples <- 10000
# Generate samples and calculate means
sample_means <- numeric(num_samples)

for (i in 1:num_samples) {
    sample <- sample(data_subset, size = sample_size, replace = TRUE)
    sample_means[i] <- mean(sample)
}
# Plot the histogram of sample means
hist(sample_means, main = "Distribution of Sample Means for Life Expectancy", xlab = "Sample Mean", col = "lightblue", border = "black")

#Set sample sizes for iterations
sample_sizes <- c(1,2,3, 10, 30, 100)
#Generate samples and plot histograms for different sample sizes 1, 2, 3, 10, 30, 100
#Repeat the Central Limit Theorem demonstration for different sample sizes and plots histograms in a 2x2 grid using par(mfrow = c(2, 2)).

par(mfrow = c(2, 2))
for (size in sample_sizes) {
    sample_means <- numeric(num_samples)
    for (i in 1:num_samples) {
        sample <- sample(data_subset, size = size, replace = TRUE)
        sample_means[i] <- mean(sample)
    }
    hist(sample_means, main = paste("Sample Size =", size), xlab = "Sample Mean", col = "lightblue", border = "black")
}

ANSWER HERE BELOW The Central Limit Theorem (CLT) illustrated graphically by the using the histograms with sample sizes of 1, 2, 3, 10, 30, and 100. The histogram’s shifting forms make it evident how the distribution of sample means is impacted by growing sample size.

Specifically, for a sample size of 1, the histogram mirrors the original data distribution which is highly skewed. Each sample mean is a separate data point because no averaging takes place. The sample means group closer to the population mean as the sample sizes grow to two and three, although the histograms still exhibit skewness but start to exhibit a centralizing trend. When the sample size rises to 10, the skewness significantly reduces. At this point, the histogram starts to take on the shape of a bell. The distribution of sample means gets more symmetric while it has a narrower dispersion when the sample size is greater than thirty. With sample means tightly concentrated around the population mean and little variation, the histogram tends to acquire an almost perfectly normal distribution when the sample size is increased to 100.

This pattern aligns with the Central Limit Theorem (CLT), which states that as the sample size increases, the sampling distribution of the mean becomes closer to a normal distribution, regardless of the original shape of the population (Mishra, 2023).

This display highlights the practical significance of the CLT. It allows to assume normality even with non-normal data while minimizes the influence of outliers and fluctuations. By allowing this, the Central Limit Theorem (CLT) guarantees that bigger sample sizes lead to trustworthy and applicable findings.

References: Mishra, M. L. (2023). Central Limit Theorem (CLT) Definition and Examples. Builtin. Retrieved November 30, 2024, from https://builtin.com/data-science/understanding-central-limit-theorem

Question 5: Bivariate Regression. (video_comment_count)

# Use 'video_comment_count' for this analysis
data_subset <- tik_tok_dataset$video_comment_count

# Bivariate Regression using Gapminder

# Life expectancy is the dependent variable and GDP per capita is the independent variable

bivariate_model <- lm(lifeExp ~ gdpPercap, data = gapminder)
print("Bivariate Regression Summary:")
## [1] "Bivariate Regression Summary:"
print(summary(bivariate_model))
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap, data = gapminder)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.754  -7.758   2.176   8.225  18.426 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.396e+01  3.150e-01  171.29   <2e-16 ***
## gdpPercap   7.649e-04  2.579e-05   29.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.49 on 1702 degrees of freedom
## Multiple R-squared:  0.3407, Adjusted R-squared:  0.3403 
## F-statistic: 879.6 on 1 and 1702 DF,  p-value: < 2.2e-16
#Make a graph 
# To create a scatterplot with regression line

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
    geom_point(color = "blue") +           # Scatterplot points
    geom_smooth(method = "lm", color = "red") +  # Regression line
    labs(title = "Scatterplot of Life Expectancy vs. GDP per Capita",
         x = "GDP per Capita",
         y = "Life Expectancy") +
    theme_minimal()  # to change the background, this is optional
## `geom_smooth()` using formula = 'y ~ x'

ANSWER HERE BELOW The regression analysis looks at the connection between GDP per capita (income level) and life expectancy. The results suggest a strong and positive relationship: as GDP per capita increases, life expectancy also rises. Specifically, the model estimates that for every additional unit of GDP per capita, life expectancy increases by about 7.65 years. This relationship is statistically significant (p < 0.001). While, the standard error is 10.49 on 1702 degrees of freedom.

The starting point, or intercept, is 5.4 years, which implies that in a hypothetical scenario where GDP per capita is zero, life expectancy would be just 5.4 years. While this may not be practical, it establishes a starting point for the model. The 34.07% R-squared value indicates that GDP per capita accounts for roughly one-third of the variation in life expectancy, indicating the presence of other significant influencing factors.

The regression line confirms the pattern observed in the scatterplot, indicating a direct correlation between GDP per capita and life expectancy. This highlights the common belief that people in wealthier nations usually live longer because of better living conditions, healthcare, and diet. Consequently, an uptick in GDP is continuously linked with enhanced living standards. The primary goal of the government should be to improve access to healthcare and invest in education to stimulate economic development.

However, the model has limitations. It assumes the relationship is linear and that other conditions like independence and consistency of data variability are met. Importantly, it doesn’t account for other critical factors like education, access to healthcare, and cultural differences, which also impact life expectancy. Some of these factors might be linked to GDP per capita, potentially skewing the results.

To get a fuller picture, adding more variables to the analysis would help better understand the complex factors that influence life expectancy. This single-variable model provides a solid starting point but doesn’t tell the whole story.

Question 6: Multivariate Regression.

# load the car package for avPlots
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
# the multivariate regression model
multivariate_model <- lm(lifeExp ~ gdpPercap + pop, data = gapminder)

# the summary of the model
print("Multivariate Regression Summary:")
## [1] "Multivariate Regression Summary:"
print(summary(multivariate_model))
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap + pop, data = gapminder)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.754  -7.745   2.055   8.212  18.534 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.365e+01  3.225e-01  166.36  < 2e-16 ***
## gdpPercap   7.676e-04  2.568e-05   29.89  < 2e-16 ***
## pop         9.728e-09  2.385e-09    4.08 4.72e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.44 on 1701 degrees of freedom
## Multiple R-squared:  0.3471, Adjusted R-squared:  0.3463 
## F-statistic: 452.2 on 2 and 1701 DF,  p-value: < 2.2e-16
# make the graph using added variable plots for each predictor in the model
avPlots(multivariate_model, 
        main = "Added Variable Plots for Life Expectancy Model",
        col = "blue",    # this is the color for the points
        pch = 16)        # the point character

ANSWER HERE BELOW The multivariate regression analysis looks at how life expectancy is influenced by GDP per capita and population. From the results, the intercept is 53.65, which represents the estimated life expectancy when both GDP per capita and population are zero—a value that’s more of a mathematical starting point than a real-world insight. The GDP per capita has a strong positive effect, with life expectancy increasing by 0.0007676 for every unit rise in GDP per capita, showing a clear link between economic growth and better health outcomes. Population also has a positive effect, but it’s much smaller, indicating that its impact on life expectancy might be less direct or dependent on other factors.

The model does a decent job of explaining the data, with an R-squared value of 34.71%, meaning it accounts for about a third of the variation in life expectancy. Added variable plots give us a deeper look: GDP per capita shows a strong, consistent trend, confirming its importance, while population has a weaker and more scattered effect. Both predictors are statistically significant, and there’s no major multicollinearity between them, so we can trust their individual contributions to the model.

Wealthier countries with higher GDP per capita have longer life expectancies, likely due to better healthcare and resources. Urbanization, infrastructure, and healthcare systems may be more important than population size, which has a smaller effect.

Nevertheless, the model has drawbacks. It makes the potentially unrealistic assumption that predictors and life expectancy have a linear relationship. Additionally, the model leaves out environmental factors and education, which could leave out significant influences. The analysis is helpful, particularly when it comes to GDP per capita, but it could be strengthened by including additional pertinent variables to gain a deeper understanding of life expectancy.

Question 7: Compare your analysis to AI generated output. (video_comment_count)

We use ChatGPT Data Analyst function which can drop in any files and it can help analyze and visualize the data.

Plot generate by AI:

ANSWER HERE BELOW To create the AI analysis and visuals, I used the dataset’s full column of video comment counts. This question appears to compare the precision, repeatability, and ease of use of analysis carried out with AI against R. This comparison demonstrates how several tools handle the CLT and associated statistical studies, including Monte Carlo simulations, regression, and confidence intervals.

R’s analysis offers a more accurate representation of the underlying data and the CLT because it allows complete control over variable selection, data transformations, and iterative refinements. Every step of the process is visible, enabling detailed customization and ensuring transparency in calculations. For example, R’s ability to generate histograms of sample means for varying sample sizes demonstrates the progression of the CLT with clarity and precision. AI-generated analyses are helpful but rely heavily on the quality of prompts and the details provided. While AI can efficiently create graphs and analyze data, it lacks the granularity and flexibility inherent to R for specialized tasks.

The visualizations convey the same general insights, such as the normal distribution of sample means as sample size increases and the significant relationship between likes and video views in regression analysis. However, AI-generated visualizations may lack contextual annotations and are less tailored to specific datasets. Despite this, the AI-generated outputs were accurate and visually appealing, though I did not notice any significant font or presentation errors.

For decision-making, I would rely on R’s analytics for its precision and adaptability. However, AI-generated analytics provide a quick and efficient way to preview results or validate R’s output, offering an additional layer of confidence. Together, they complement each other, with R focusing on rigour and customization, while AI provides speed and simplicity.

References:

Mishra, M. L. (2023). Central Limit Theorem (CLT) Definition and Examples. Builtin. Retrieved November 30, 2024, from https://builtin.com/data-science/understanding-central-limit-theorem Moore, D. S., McCabe, G. P., & Craig, B. A. (2017). Introduction to the practice of statistics (9th ed.). W. H. Freeman. Rice, J. A. (2006). Mathematical statistics and data analysis (3rd ed.). Cengage Learning.