Database Manipulation in R

1.Libraries

library(UsingR)

## Cargando paquete requerido: MASS

## Cargando paquete requerido: HistData

## Cargando paquete requerido: Hmisc

## 
## Adjuntando el paquete: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(MASS)
library(dplyr)

## 
## Adjuntando el paquete: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

2.Brightness Dataset

Histogram and Density Plot of Brightness

brightness_df <-data.frame(brightness)
ggplot(brightness_df, aes(x = brightness)) +
  geom_histogram(aes(y = ..density..), bins = 20, fill = "lightblue", color = "black") +
  geom_density(color = "red", size = 1) +
  labs(title = "Histogram and Density Plot of Brightness", x = "Brightness", y = "Density") +
  coord_cartesian(ylim = c(0, 0.5))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

b. Boxplot of brightness

summary(brightness_df)

##    brightness    
##  Min.   : 2.070  
##  1st Qu.: 7.702  
##  Median : 8.500  
##  Mean   : 8.418  
##  3rd Qu.: 9.130  
##  Max.   :12.430

ggplot(brightness_df, aes(y = brightness)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  labs(title = "Boxplot of Brightness", y = "Brightness")

# Identify outliers using boxplot stats
outliers <- boxplot.stats(brightness)$out

# Find the second smallest outlier
second_smallest_outlier <- sort(outliers)[2]
#second_smallest_outlier

cat("The second smallest outlier is ", second_smallest_outlier)

## The second smallest outlier is  2.28

According to summary statistics:

The minimum value of the dataset is 2.070.

The first quartile (25th percentile) is 7.702, indicating that 25% of the data points are less than or equal to this value.

The median (50th percentile) is 8.500, which represents the middle value of the dataset when it is sorted in ascending order.

The mean (average) of the dataset is 8.418.

The third quartile (75th percentile) is 9.130, indicating that 75% of the data points are less than or equal to this value.

The maximum value of the dataset is 12.430.

Additionally, it is mentioned that the second smallest outlier in the dataset is 2.28. An outlier is a data point that significantly differs from other observations in the dataset.

c.Boxplot of Brightness without Outliers

# Define the function to remove outliers
remove_outliers <- function(x) {
  qnt <- quantile(x, probs = c(.25, .75), na.rm = TRUE)
  print(qnt)
  H <- 1.5 * IQR(x, na.rm = TRUE)
  y <- x[x >= (qnt[1] - H) & x <= (qnt[2] + H)]
  print((qnt[1] - H))
  print((qnt[2] + H))
  return(y)
}

# Create new variable excluding outliers
brightness_without <- remove_outliers(brightness)

##    25%    75% 
## 7.7025 9.1300 
##     25% 
## 5.56125 
##      75% 
## 11.27125

# Confirm the outliers are removed
length(brightness_without)

## [1] 928

summary(brightness_without)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.570   7.750   8.500   8.423   9.090  11.260

brightness_without_df<-data.frame(brightness_without)
ggplot(brightness_without_df, aes(y = brightness_without)) +
  geom_boxplot(fill = "lightblue", color = "black",outlier.shape = NA) +
  labs(title = "Boxplot of Brightness without Outliers", y = "Brightness")

After removing the outliers from the dataset, the summary statistics are as follows:

The minimum value of the dataset is now 5.570, which is higher than the minimum value before removing outliers (2.070). This suggests that the outliers were lower values.

The first quartile (25th percentile) is 7.750, indicating that 25% of the data points are less than or equal to this value.

The median (50th percentile) remains unchanged at 8.500, suggesting that the central tendency of the dataset did not significantly shift after removing outliers.

The mean (average) of the dataset is 8.423, which is slightly higher than the mean before removing outliers (8.418).

The third quartile (75th percentile) is 9.090, indicating that 75% of the data points are less than or equal to this value.

The maximum value of the dataset is now 11.260, which is lower than the maximum value before removing outliers (12.430).

Overall, removing the outliers resulted in a dataset with a narrower range of values, as evidenced by the smaller difference between the minimum and maximum values. The central tendency of the dataset, as indicated by the median and mean, remained relatively stable, suggesting that the outliers did not heavily influence the overall average. However, the removal of outliers led to a slight increase in the mean value and a reduction in the maximum value of the dataset.

USCereals Datasets

Relation between manufacturer and shelf

# Create a dataframe
UScereal_df<-UScereal

# Create a bar plot to visualize the relationship between manufacturer and shelf
ggplot(UScereal_df, aes(x = reorder(mfr, -prop.table(table(mfr))[mfr]), fill = shelf)) +
  geom_bar() +
  labs(title = "Relation between manufacturer and shelf",
       x = "Manufacturer",
       y = "Proportion of shelf") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

Relation between fat and vitamins

# Create a bar plot to visualize the relationship between fat and vitamins
ggplot(UScereal_df, aes(x = reorder(vitamins, -prop.table(table(vitamins))[vitamins]), fill = fat)) +
  geom_bar() +
  labs(title = "Relation between fat and vitamins",
       x = "Vitamins",
       y = "Proportion of fat") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

Relation between fat and shelf

# Create a scatter plot to visualize the relationship between fat and shelf
ggplot(UScereal, aes(x = shelf, y = fat)) +
  geom_point() +
  labs(title = "Relationship between fat and shelf",
       x = "Shelf",
       y = "fat")

Relation between carbohydrates and sugars

# Calculate the correlation between carbohydrates and sugars
carb_sugar_correlation <- cor(UScereal_df$carbo, UScereal_df$sugars, use = "complete.obs")
print(paste("Correlation between carbohydrates and sugars:", carb_sugar_correlation))

## [1] "Correlation between carbohydrates and sugars: -0.0408259889669842"

ggplot(UScereal_df, aes(x = UScereal_df$carbo, y = UScereal_df$sugars)) +
  geom_point() +
  labs(title = paste("Relationship between Carbohydrates and Sugars"),
       x = "Carbohydrates",
       y = "Sugars")

## Warning: Use of `UScereal_df$carbo` is discouraged.
## ℹ Use `carbo` instead.

## Warning: Use of `UScereal_df$sugars` is discouraged.
## ℹ Use `sugars` instead.

Relation between fibre and manufacturer

# Create a bar plot to visualize the relationship between fibre and manufacturer
ggplot(UScereal_df, aes(x = reorder(mfr, -prop.table(table(mfr))[mfr]), fill = fibre)) +
  geom_bar() +
  labs(title = "Relation between fibre and manufacturer",
       x = "Manufacturer",
       y = "Proportion of fibre") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

e.Relation between sodium and sugars

# Calculate the correlation between carbohydrates and sugars
carb_sugar_correlation <- cor(UScereal_df$sodium, UScereal_df$sugars, use = "complete.obs")
print(paste("Correlation between carbohydrates and sugars:", carb_sugar_correlation))

## [1] "Correlation between carbohydrates and sugars: 0.211243651049779"

ggplot(UScereal_df, aes(x = UScereal_df$sodium, y = UScereal_df$sugars)) +
  geom_point() +
  labs(title = paste("Relationship between Sodium and Sugars"),
       x = "Sodium",
       y = "Sugars")

## Warning: Use of `UScereal_df$sodium` is discouraged.
## ℹ Use `sodium` instead.

## Warning: Use of `UScereal_df$sugars` is discouraged.
## ℹ Use `sugars` instead.

4.Mammals Dataset

a.Linear correlation between body weight and brain weight

# Calculate the correlation between carbohydrates and sugars
body_brain_correlation <- cor(mammals$body, mammals$brain, use = "complete.obs")
print(paste("Correlation between body weight and brain weight:", body_brain_correlation))

## [1] "Correlation between body weight and brain weight: 0.934163842323355"

Correlation Plot

ggplot(mammals, aes(x = mammals$body, y = mammals$brain)) +
  geom_point() +
  labs(title = paste("Relationship between body weight and brain weight"),
       x = "Body weight",
       y = "Brain weight")

## Warning: Use of `mammals$body` is discouraged.
## ℹ Use `body` instead.

## Warning: Use of `mammals$brain` is discouraged.
## ℹ Use `brain` instead.

Log Scale

body_brain_correlation_log <- cor(log(mammals$body), log(mammals$brain), use = "complete.obs")
print(paste("Correlation between body weight and brain weight:", body_brain_correlation_log))

## [1] "Correlation between body weight and brain weight: 0.95957475837098"

ggplot(mammals, aes(x = log(mammals$body), y = log(mammals$brain))) +
  geom_point() + geom_smooth(method = "lm", col = "blue")+
  labs(title = paste("Relationship between body weight and brain weight in Log Scale"),
       x = "Body weight",
       y = "Brain weight")

## Warning: Use of `mammals$body` is discouraged.
## ℹ Use `body` instead.

## Warning: Use of `mammals$brain` is discouraged.
## ℹ Use `brain` instead.

## Warning: Use of `mammals$body` is discouraged.
## ℹ Use `body` instead.

## Warning: Use of `mammals$brain` is discouraged.
## ℹ Use `brain` instead.

## `geom_smooth()` using formula = 'y ~ x'

5.Emissions Dataset

Relationship between the variables GDP, perCapita and CO2 of each country

# Scatter plot for GDP vs CO2 emissions
ggplot(emissions, aes(x = GDP, y = CO2)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  labs(title = "GDP vs CO2 Emissions",
       x = "GDP",
       y = "CO2 Emissions")

## `geom_smooth()` using formula = 'y ~ x'

# Scatter plot for perCapita vs CO2 emissions
ggplot(emissions, aes(x = perCapita, y = CO2)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  labs(title = "Per Capita vs CO2 Emissions",
       x = "Per Capita",
       y = "CO2 Emissions")

## `geom_smooth()` using formula = 'y ~ x'

# Scatter plot for GDP vs perCapita
ggplot(emissions, aes(x = GDP, y = perCapita)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  labs(title = "GDP vs Per Capita",
       x = "GDP",
       y = "Per Capita")

## `geom_smooth()` using formula = 'y ~ x'

Linear regression model to predict CO2 emissions from of the variables

# Fit a linear regression model
model <- lm(CO2 ~ GDP + perCapita, data = emissions)

# Summary of the model
summary(model)

## 
## Call:
## lm(formula = CO2 ~ GDP + perCapita, data = emissions)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1037.3  -167.4    10.8   153.2  1052.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.100e+02  2.044e+02   2.495   0.0202 *  
## GDP          8.406e-04  5.198e-05  16.172 4.68e-14 ***
## perCapita   -3.039e-02  1.155e-02  -2.631   0.0149 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 382.8 on 23 degrees of freedom
## Multiple R-squared:  0.9253, Adjusted R-squared:  0.9188 
## F-statistic: 142.5 on 2 and 23 DF,  p-value: 1.102e-13

The formula of the linear regression model is:

CO2 = 510 + 0.0008406×GDP − 0.03039×perCapita

Where:

CO2 is the response variable (CO2 emissions).

GDP is the independent variable (Gross Domestic Product).

perCapita is another independent variable (Per Capita Income).

The model indicates that, holding per capita income constant, for every one unit increase in Gross Domestic Product (GDP), CO2 emissions increase by 0.0008406 units. Conversely, holding GDP constant, for every one unit increase in per capita income, CO2 emissions decrease by 0.03039 units.

The coefficient of determination (Multiple R-squared) indicates that the model explains approximately 92.53% of the variability in CO2 emissions.

Outliers and Linear regression model

# Identify outliers using the residuals of the model
emissions <- emissions %>%
  mutate(residuals = residuals(model))

# Define a threshold for identifying outliers (e.g., 2 standard deviations)
threshold <- 2 * sd(emissions$residuals)

# Filter out the outliers
clean_emissions_df <- emissions %>%
  filter(abs(residuals) < threshold)

# Fit the model again without outliers
clean_model <- lm(CO2 ~ GDP + perCapita, data = clean_emissions_df)

# Summary of the new model
summary(clean_model)

## 
## Call:
## lm(formula = CO2 ~ GDP + perCapita, data = clean_emissions_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -469.70  -50.26   30.88   79.67  348.27 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.917e+02  1.160e+02   1.653   0.1132    
## GDP          8.490e-04  2.774e-05  30.603   <2e-16 ***
## perCapita   -1.307e-02  6.453e-03  -2.026   0.0557 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 196.3 on 21 degrees of freedom
## Multiple R-squared:  0.9811, Adjusted R-squared:  0.9793 
## F-statistic: 544.3 on 2 and 21 DF,  p-value: < 2.2e-16

The formula of the linear regression model without outliers is:

CO2 = 191.7 + 0.000849×GDP − 0.01307×perCapita

In this model statistics the coefficient of determination (Multiple R-squared) indicates that the model explains approximately 98.11% of the variability in CO2 emissions.

6.Anorexia dataset

What treatment was most effective?

# Calculate weight change
anorexia <- anorexia %>%
  mutate(WeightChange = Postwt - Prewt)

# View the updated data
print(head(anorexia,5))

##   Treat Prewt Postwt WeightChange
## 1  Cont  80.7   80.2         -0.5
## 2  Cont  89.4   80.1         -9.3
## 3  Cont  91.8   86.4         -5.4
## 4  Cont  74.0   86.3         12.3
## 5  Cont  78.1   76.1         -2.0

# Calculate average weight change for each treatment group
avg_weight_change <- anorexia %>%
  group_by(Treat) %>%
  summarise(AverageWeightChange = mean(WeightChange))
print(avg_weight_change)

## # A tibble: 3 × 2
##   Treat AverageWeightChange
##   <fct>               <dbl>
## 1 CBT                 3.01 
## 2 Cont               -0.450
## 3 FT                  7.26

The treatment FT with the highest average weight change is considered the most effective.

How many patients gained and how many lost weight?

# Count the number of patients who gained and lost weight
weight_gain_loss <- anorexia %>%
  summarise(
    GainedWeight = sum(WeightChange > 0),
    LostWeight = sum(WeightChange < 0),
    NoChange = sum(WeightChange == 0)
  )

# View the count of patients who gained and lost weight
print(weight_gain_loss)

##   GainedWeight LostWeight NoChange
## 1           42         29        1

Vectors by rnorm

vector1 <- rnorm(50)
vector2 <- rnorm(50)

A normality test by Shapiro-Wilk

# Perform Shapiro-Wilk Normality Test
shapiro_test_vector1 <- shapiro.test(vector1)
shapiro_test_vector2 <- shapiro.test(vector2)

# Output Results
print("Shapiro-Wilk Test for Vector 1:")

## [1] "Shapiro-Wilk Test for Vector 1:"

print(shapiro_test_vector1)

## 
##  Shapiro-Wilk normality test
## 
## data:  vector1
## W = 0.98467, p-value = 0.7572

print("Shapiro-Wilk Test for Vector 2:")

## [1] "Shapiro-Wilk Test for Vector 2:"

print(shapiro_test_vector2)

## 
##  Shapiro-Wilk normality test
## 
## data:  vector2
## W = 0.97908, p-value = 0.5137

t-student test

# Perform t-test
t_test_result <- t.test(vector1, vector2)

print("t-test Result:")

## [1] "t-test Result:"

print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  vector1 and vector2
## t = 0.94841, df = 97.926, p-value = 0.3453
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1722650  0.4876447
## sample estimates:
##   mean of x   mean of y 
## 0.161297429 0.003607581

If the calculated t-value is small suggests a small difference between the means of the two samples. However, a high p-value indicates that this difference is not statistically significant at the conventional significance level of 0.05. Therefore, we fail to reject the null hypothesis, which states that there is no difference between the means of the two populations. The 95% confidence interval for the difference in means further supports this conclusion.

Vectors by rnorm and rbinom

vector1 <- rnorm(50)
vector2 <- rbinom(50, size = 10, prob = 0.5)

A normality test by Shapiro-Wilk

# Perform Shapiro-Wilk Normality Test
shapiro_test_vector1 <- shapiro.test(vector1)
shapiro_test_vector2 <- shapiro.test(vector2)

# Output Results
print("Shapiro-Wilk Test for Vector 1:")

## [1] "Shapiro-Wilk Test for Vector 1:"

print(shapiro_test_vector1)

## 
##  Shapiro-Wilk normality test
## 
## data:  vector1
## W = 0.97568, p-value = 0.3873

print("Shapiro-Wilk Test for Vector 2:")

## [1] "Shapiro-Wilk Test for Vector 2:"

print(shapiro_test_vector2)

## 
##  Shapiro-Wilk normality test
## 
## data:  vector2
## W = 0.96363, p-value = 0.1262

t-student test

# Perform t-test
t_test_result <- t.test(vector1, vector2)

print("t-test Result:")

## [1] "t-test Result:"

print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  vector1 and vector2
## t = -17.43, df = 75.939, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.693275 -4.525593
## sample estimates:
##  mean of x  mean of y 
## -0.3094337  4.8000000

100 values from a normal distribution

values <- rnorm(100)

Histogram

ggplot(data.frame(values), aes(x = values)) +
  geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
  labs(title = "Histogram of 100 Values from Normal Distribution",
       x = "Values",
       y = "Frequency")

Repeat the process a few times. What do you observe?

#par(mfrow = c(2, 2))  # Setting up the plotting area for 4 histograms

for (i in 1:4) {
  values <- rnorm(100)
  
  # Plot histogram
  ggplot(data.frame(values), aes(x = values)) +
    geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
    labs(title = "Histogram of 100 Values from Normal Distribution",
         x = "Values",
         y = "Frequency")+
  facet_wrap(~ set, nrow = 2)
  
  # Print summary statistics
  print(summary(values))
}

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.17133 -0.55049  0.10668  0.06309  0.70933  2.96592 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.4229 -0.9351 -0.3733 -0.2842  0.3064  2.3231 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.1684 -0.7009 -0.1562 -0.0308  0.5589  3.2160 
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.30039 -0.65965  0.06507  0.05972  0.68309  2.55879

# Create a list to store generated values and set numbers
values_list <- list()
set_numbers <- c()

# Generate values and store them in the list
for (i in 1:4) {
  values <- rnorm(100)
  values_list[[i]] <- data.frame(values = values)
  set_numbers <- c(set_numbers, rep(i, length(values)))

}

# Combine all data frames into one
values_df <- do.call(rbind, values_list)
values_df$set <- factor(set_numbers)

# Create the plot using ggplot2 with facet_wrap()
ggplot(values_df, aes(x = values)) +
  geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
  labs(title = "Histogram of 100 Values from Normal Distribution",
       x = "Values",
       y = "Frequency") +
  facet_wrap(~ set, nrow = 2)

The variability in Histograms is the result of random sampling from the normal distribution, leading to slightly different shapes and patterns in the histograms each time

Data summary

# Data summary
for (i in 1:4) {
  
  cat("Summary Statistics for Set", i, ":\n")
  print(summary(values_list[[i]])) 
}

## Summary Statistics for Set 1 :
##      values        
##  Min.   :-1.76218  
##  1st Qu.:-0.66566  
##  Median :-0.02595  
##  Mean   : 0.01973  
##  3rd Qu.: 0.69894  
##  Max.   : 3.25963  
## Summary Statistics for Set 2 :
##      values        
##  Min.   :-2.13620  
##  1st Qu.:-0.56869  
##  Median :-0.04156  
##  Mean   :-0.03829  
##  3rd Qu.: 0.56558  
##  Max.   : 2.17264  
## Summary Statistics for Set 3 :
##      values        
##  Min.   :-1.99566  
##  1st Qu.:-0.70054  
##  Median :-0.11852  
##  Mean   :-0.05776  
##  3rd Qu.: 0.50487  
##  Max.   : 2.69948  
## Summary Statistics for Set 4 :
##      values       
##  Min.   :-1.9943  
##  1st Qu.:-0.4952  
##  Median : 0.1741  
##  Mean   : 0.1701  
##  3rd Qu.: 0.7191  
##  Max.   : 2.4137

Function pwr.t.test

The pwr.t.test function is a statistical tool used in the R programming language to perform power calculations for t-tests. Power analysis is an important aspect of experimental design and hypothesis testing, as it allows researchers to determine the sample size needed to detect a specified effect size with a given level of confidence.

a.Example 1 For comparing the gene expression levels between melanoma and non-melanoma skin cancer samples, and you expect an effect size (d) of 0.5 between the two groups with a power of 80% and a significance level of 5% (alpha = 0.05).

library(pwr)

# Expected effect size (d)
effect_size <- 0.5

# Desired power of the test
power <- 0.8

# Significance level
alpha <- 0.05

# Test type (two-sample t-test)
test_type <- "two.sample"

# Calculate the necessary sample size
pwr.t.test(d = effect_size, power = power, sig.level = alpha, type = test_type)

## 
##      Two-sample t test power calculation 
## 
##               n = 63.76561
##               d = 0.5
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

b.Example 2

Suppose we want to investigate whether a new treatment for a specific type of skin cancer (e.g., melanoma) leads to a significant reduction in tumor size compared to the standard treatment.

We want to calculate the sample size needed to detect a clinically meaningful difference in tumor size between the two treatment groups. We anticipate an effect size of 0.3, a desired power of 0.90, and a significance level of 0.05.

# Expected effect size (d)
effect_size <- 0.3

# Desired power of the test
power <- 0.9

# Significance level
alpha <- 0.05

# Test type (two-sample t-test)
test_type <- "two.sample"

# Calculate the necessary sample size
pwr.t.test(d = effect_size, power = power, sig.level = alpha, type = test_type)

## 
##      Two-sample t test power calculation 
## 
##               n = 234.4627
##               d = 0.3
##       sig.level = 0.05
##           power = 0.9
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Database Manipulation in R

Nazly R. Hincapie Monsalve

2024-05-24