1. Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes.

Read the data from the .csv file

data <- read.csv(“C:/Users/HBE/OneDrive - CAST/CUNY/Bridge Workshop/Homework Materials/R/data.csv”, header = TRUE)

The ‘summary()’ function to get an overview of the dataset

summary(data)

mean and median of the attribute n_factor

mean_n_factor <- mean(data$n_factor) median_n_factor <- median(data$n_factor)

cat(“Mean of n_factor:”, mean_n_factor, “”) cat(“Median of n_factor:”, median_n_factor, “”)

mean and median of the attribute n_logical

mean_n_logical <- mean(data$n_logical) median_n_logical <- median(data$n_logical)

cat(“Mean of n_logical:”, mean_n_logical, “”) cat(“Median of n_logical:”, median_n_logical, “”)

mean and median of the attribute n_numeric

mean_n_numeric <- mean(data$n_numeric) median_n_numeric <- median(data$n_numeric)

cat(“Mean of n_numeric:”, mean_n_numeric, “”) cat(“Median of n_numeric:”, mean_n_numeric, “”)

2. Create a new data frame with a subset of the columns and rows. Make sure to rename it.

Subset columns and rows to create a new data frame

subset_df <- data[data$n_factor > 10, c(“Package”, “Item”, “Title”, “n_factor”, “n_logical”, “n_numeric”)]

Rename the new data frame to “filtered_data”

filtered_data <- subset_df

3. Create new column names for the new data frame.

New column names for the remaining columns (starting from the fourth column)

new_column_names <- c(“n_fac”, “n_log”, “n_num”)

Get the original column names

original_column_names <- names(filtered_data)

Combine the unchanged column names with the new names for the remaining columns

updated_column_names <- c(original_column_names[1:3], new_column_names)

Update the column names of the data frame

names(filtered_data) <- updated_column_names

4. Use the summary function to create an overview of your new data frame. Then print the mean and median for the same two attributes. Please compare.

Create a statistical summary of the data frame

data_summary <- summary(filtered_data)

Display the summary

print(data_summary)

mean and median of the attribute n_fac

mean_n_fac <- mean(filtered_data$n_fac) median_n_fac <- median(filtered_data$n_fac)

cat(“Mean of n_fac:”, mean_n_fac, “”) cat(“Median of n_fac:”, median_n_fac, “”)

mean and median of the attribute n_log

mean_n_log <- mean(filtered_data$n_log) median_n_log <- median(filtered_data$n_log)

cat(“Mean of n_log:”, mean_n_log, “”) cat(“Median of n_log:”, mean_n_log, “”)

mean and median of the attribute n_num

mean_n_num <- mean(filtered_data$n_num) median_n_num <- median(filtered_data$n_num)

cat(“Mean of n_num:”, mean_n_num, “”) cat(“Median of n_num:”, median_n_num, “”)

comparing

extract the relevant attributes from each data frame

n_factor_data <- data$n_factor n_logical_data <- data$n_logical n_numeric_data <- data$n_numeric

n_fac_filtered <- filtered_data$n_fac n_log_filtered <- filtered_data$n_log n_num_filtered <- filtered_data$n_num

When creating a subset data frame from the ‘data’ table using the condition ‘data$n_factor > 10’, the resulting data frame contains different values compared to the original ‘data’ table. The subset data frame only includes rows where the ‘n_factor’ column has values greater than 10. This filtering process alters the data, retaining only those rows that satisfy the given condition, and excludes the other rows from the original data set.

n_fac vs. n_factor: Due to the filtering process in ‘filtered_data$n_fac’, which includes only values greater than 10 from the ‘n_factor’ column, the resulting mean and median of ‘n_fac’ in ‘filtered_data’ will be greater than the mean and median of ‘n_factor’ from the original ‘data’ table. This is because the filtering removes values that are less than or equal to 10, potentially skewing the distribution of ‘n_fac’ towards higher values, leading to an increase in both mean and median.

n_num vs. n_numeric: Due to the same filtering process mentioned earlier, it appears that the ‘n_numeric’ column was also filtered with higher values when creating the subset dataframe ‘filtered_data’. As a result, the mean and median of ‘n_num’ in ‘filtered_data’ will be greater than the mean and median of ‘n_numeric’ from the original dataset. The filtering of ‘n_numeric’ is a consequence of the condition applied during the subset creation, which selects only values greater than 10 for the ‘n_factor’ column. This can lead to an increase in both mean and median for ‘n_num’ in ‘filtered_data’ compared to ‘n_numeric’ in the original dataset.

We are also comparing the medians using the Wilcoxon rank-sum test. Wilcoxon rank-sum test is a non-parametric test suitable for comparing medians between two data frames

For n_factor attribute

wilcox_factor <- wilcox.test(n_factor_data, n_fac_filtered)

For n_logical attribute

wilcox_logical <- wilcox.test(n_logical_data, n_log_filtered)

For n_numeric attribute

wilcox_numeric <- wilcox.test(n_numeric_data, n_num_filtered)

Print the results

cat(“Wilcoxon rank-sum test results for n_factor:”) print(wilcox_factor)

cat(“rank-sum test results for n_logical:”) print(wilcox_logical)

cat(“rank-sum test results for n_numeric:”) print(wilcox_numeric)

5. For at least 3 values in a column please rename so that every value in that column is renamed. For example, suppose I have 20 values of the letter “e” in one column. Rename those values so that all 20 would show as “excellent”.

From the data frame “filtered_data” renaming 3 values from column “Package”

Renaming the values “AER”, “Ecdat”,and “medicaldata” to “REA”, “tadcE”, and “atadlacidem”, respectively.

filtered_data$Package <- ifelse(filtered_data$Package %in% c(“AER”, “Ecdat”, “medicaldata”), c(“REA”, “tadcE”, “atadlacidem”)[match(filtered_data$Package, c("AER", "Ecdat", "medicaldata"))], filtered_data$Package)

6. Display enough rows to see examples of all of steps 1-5 above.

data’ is the name of the dataset

Display the first few rows of the original dataset

print(“Step 1: Original dataset”) head(data, 3)

Display the first few rows of the original dataset

print(“Step 2: ‘filtered_data’ with n_factor > 10”) head(filtered_data, 3)

print(“Step 3.1: Mean and Median of ‘n_fac’ in ‘filtered_data’”) print(mean_n_fac) print(median_n_fac)

print(“Step 3.2: Mean and Median of ‘n_factor’ in ‘data’”) print(mean_n_factor) print(median_n_factor)

print(“Step 4.1: Mean and Median of ‘n_num’ in ‘filtered_data’”) print(mean_n_num) print(median_n_num)

print(“Step 4.2: Mean and Median of ‘n_numeric’ from in ‘data’”) print(mean_n_numeric) print(median_n_numeric)

print(“Step 5.1: Mean and Median of ‘n_log’ in ‘filtered_data’”) print(mean_n_log) print(median_n_log)

print(“Step 5.2: Mean and Median of ‘n_logical’ in ‘data’”) print(mean_n_log) print(median_n_log)

7. BONUS – place the original .csv in a github file and have R read from the link.

install the needed packages

install.packages(“readr”) library(readr)

Since the default connection buffer is small setting the size to a larger value (e.g., 1 million bytes)

Sys.setenv(“VROOM_CONNECTION_SIZE” = 1e6)

set url to the data.csv file location in GitHub

url <- “https://github.com/hbedros/R_HW2/blob/main/data.csv”

Read the .csv file from the GitHub URL

data <- read_csv(url)

Print the data frame to check if it was loaded correctly

print(data)