Objective:

In this lab, you will apply data transformation techniques, including mean-centering, calculating Z-scores, and performing non-linear transformations on various datasets. Please complete the exercises by filling in the code chunks and answering the interpretation questions. Once completed, knit this document to HTML and submit it as instructed.

Exercise 1: Mean-Centering

Dataset: - Simulated data on the number of hours spent studying per week:

Tasks:
1. Calculate the mean of the study hours.
2. Mean-center the dataset by subtracting the mean from each value.
3. Plot the original and mean-centered study hours on the same graph.
4. Interpretation: Explain what the mean-centered values tell you about the amount of time each student spent studying compared to the average.

study_hours <- c(15, 22, 18, 25, 20, 28, 24, 19, 23, 26)
# Calculate the mean of the study hours
mean_study_hours <- mean(study_hours)
mean_study_hours
## [1] 22
# Mean-center the study hours
mean_centered_study_hours <- study_hours - mean_study_hours
mean_centered_study_hours
##  [1] -7  0 -4  3 -2  6  2 -3  1  4
# Plot the original study hours
# use plot() 
# use abline(h = mean(study_hours))
plot(study_hours, type = "o", col = "blue", pch = 16, main = "Original Study Hours", 
     xlab = "Student Index", ylab = "Hours Studied")
abline(h = mean_study_hours, col = "red", lwd = 2, lty = 2) # Add a horizontal line at the mean

# Plot the mean-centered study hours
# use plot() 
# use abline(h = 0)
plot(mean_centered_study_hours, type = "o", col = "green", pch = 16, main = "Mean-Centered Study Hours", 
     xlab = "Student Index", ylab = "Mean-Centered Hours")
abline(h = 0, col = "red", lwd = 2, lty = 2) # Add a horizontal line at 0

Interpretation: By using mean-centered values in this example of students’ studying time, we can observe that students with positive mean-centered values have studied more than the average student, and students with negative mean-centered values studied less than the average student.

Exercise 2: Calculating Z-Scores

Dataset: - Simulated data on students’ reaction times (in milliseconds):

Tasks:
1. Calculate the mean and standard deviation of the reaction times.
2. Compute the Z-scores for each reaction time.
3. Plot the Z-scores on a line graph.
4. Interpretation: Discuss what a Z-score greater than 0 or less than 0 indicates about a reaction time relative to the average.

reaction_times <- c(350, 420, 310, 390, 370, 450, 380, 340, 400, 360)
# Calculate the mean and standard deviation of the reaction times
reaction_times <- c(350, 420, 310, 390, 370, 450, 380, 340, 400, 360)

# Calculate the mean of the reaction times
mean_reaction_time <- mean(reaction_times)

# Calculate the standard deviation of the reaction times
sd_reaction_time <- sd(reaction_times)

# Display mean and standard deviation
mean_reaction_time
## [1] 377
sd_reaction_time
## [1] 40.56545
# Compute the Z-scores
z_scores_reaction_times <- (reaction_times - mean_reaction_time) / sd_reaction_time

# Display the Z-scores
z_scores_reaction_times
##  [1] -0.66559107  1.06001542 -1.65165193  0.32046978 -0.17256065  1.79956105
##  [7]  0.07395456 -0.91210629  0.56698499 -0.41907586
# Plot the Z-scores
# abline(h= 0)
plot(z_scores_reaction_times, type = "o", col = "purple", pch = 16, 
     main = "Z-Scores of Reaction Times", xlab = "Index", ylab = "Z-Score")
abline(h = 0, col = "red", lwd = 2, lty = 2) # Reference line at Z=0

Interpretation: Z scores greater than 0 show us that the reaction time is above average, a z-score of less than 0 tells us that the reaction time is below the average, and z score of 0 means the reaction time is exactlt equal to the average.

Exercise 3: Non-Linear Transformations

Dataset: - Simulated data on annual sales figures (in thousands of dollars):

Tasks:
1. Apply a logarithmic transformation to the sales data.
2. Apply a square root transformation to the sales data.
3. Plot histograms of the original and transformed sales data.
4. Interpretation: Compare the distributions of the original and transformed data. Explain how each transformation affects the spread and shape of the data.

sales <- c(200, 450, 700, 1200, 300, 800, 1100, 900, 400, 1500)
# Apply a logarithmic transformation
log_sales <- log(sales)

# Display transformed values
log_sales
##  [1] 5.298317 6.109248 6.551080 7.090077 5.703782 6.684612 7.003065 6.802395
##  [9] 5.991465 7.313220
# Apply a square root transformation
sqrt_sales <- sqrt(sales)

# Display transformed values
sqrt_sales
##  [1] 14.14214 21.21320 26.45751 34.64102 17.32051 28.28427 33.16625 30.00000
##  [9] 20.00000 38.72983
# Plot histograms of the original and transformed sales data
#use hist()
hist(sales, main="Original Sales Data", xlab="Sales (in thousands)", col="lightblue", breaks=10)

# Plot histogram of log-transformed sales data
hist(log_sales, main="Log-Transformed Sales Data", xlab="Log(Sales)", col="lightgreen", breaks=10)

# Plot histogram of square root-transformed sales data
hist(sqrt_sales, main="Square Root Transformed Sales Data", xlab="Sqrt(Sales)", col="lightcoral", breaks=10)

Interpretation: The logarithmic transformation shortens the spread of larger valyes, as does the square root transformation, just more moderately. - Logarithmic Transformation: The logarithmic function helps make the distribution more symmetic when the data is right-skewed by spreading out smaller values and compressing larger ones. - Square Root Transformation: The square root transformation reduces skewness, but less dramatically. In general, it balances the data. -

Exercise 4: Combining Transformations

Dataset: - Simulated data on daily step counts:

Tasks: 1. Mean-center the step counts.
2. Calculate the Z-scores for the step counts.
3. Plot the original, mean-centered, and Z-scores on separate graphs.
4. Interpretation: Explain how the combination of mean-centering and Z-scores helps in understanding the step count data compared to looking at the original data alone.

step_counts <- c(8000, 10500, 9200, 11500, 10000, 12500, 11000, 9500, 10200, 12000)
# Mean-center the step counts
# Define step count data
step_counts <- c(8000, 10500, 9200, 11500, 10000, 12500, 11000, 9500, 10200, 12000)

# Calculate the mean
mean_steps <- mean(step_counts)

# Mean-center the step counts
mean_centered_steps <- step_counts - mean_steps

# Display the mean-centered values
mean_centered_steps
##  [1] -2440    60 -1240  1060  -440  2060   560  -940  -240  1560
# Calculate the Z-scores for the step counts
# Calculate standard deviation
sd_steps <- sd(step_counts)

# Compute the Z-scores
z_scores_steps <- (step_counts - mean_steps) / sd_steps

# Display the Z-scores
z_scores_steps
##  [1] -1.78888109  0.04398888 -0.90910351  0.77713687 -0.32258511  1.51028486
##  [7]  0.41056287 -0.68915911 -0.17595552  1.14371086
# Plot the original
plot(step_counts, type = "o", col = "blue", pch = 16, main = "Original Step Counts",
     xlab = "Index", ylab = "Steps")
abline(h = mean_steps, col = "red", lwd = 2, lty = 2) # Mean line

# Plot the  mean-centered
plot(mean_centered_steps, type = "o", col = "green", pch = 16, main = "Mean-Centered Step Counts",
     xlab = "Index", ylab = "Mean-Centered Steps")
abline(h = 0, col = "red", lwd = 2, lty = 2) # Reference line at 0

# Plot the Z-scores
plot(z_scores_steps, type = "o", col = "purple", pch = 16, main = "Z-Scores of Step Counts",
     xlab = "Index", ylab = "Z-Score")
abline(h = 0, col = "red", lwd = 2, lty = 2) # Reference line at 0

Interpretation: The combination of mean-centering and z-scores make it easier to understand the step count data by showing how each of the values compare to the average. The mean-centering helps by subtracting the avergae from each value, help us to see if the step count is higher or lower than average. Z-scores show how far each value is from the average, in standard deviation. They make the data easier to compare and digest when combined. Submission Instructions:

Ensure to knit your document to HTML format, checking that all content is correctly displayed before submission. Submit the RPubs link to Canvas Assignments.