In this lab, you will apply data transformation techniques, including mean-centering, calculating Z-scores, and performing non-linear transformations on various datasets. Please complete the exercises by filling in the code chunks and answering the interpretation questions. Once completed, knit this document to HTML and submit it as instructed.
Dataset: - Simulated data on the number of hours spent studying per week:
Tasks:
1. Calculate the mean of the study hours.
2. Mean-center the dataset by subtracting the mean from each
value.
3. Plot the original and mean-centered study hours on the same
graph.
4. Interpretation: Explain what the mean-centered
values tell you about the amount of time each student spent studying
compared to the average.
## [1] 22
# Mean-center the study hours
mean_centered_study_hours <- study_hours - mean_study_hours
mean_centered_study_hours
## [1] -7 0 -4 3 -2 6 2 -3 1 4
# Plot the original study hours
# use plot()
# use abline(h = mean(study_hours))
plot(study_hours, type = "o", col = "blue", pch = 16, main = "Original Study Hours",
xlab = "Student Index", ylab = "Hours Studied")
abline(h = mean_study_hours, col = "red", lwd = 2, lty = 2) # Add a horizontal line at the mean
# Plot the mean-centered study hours
# use plot()
# use abline(h = 0)
plot(mean_centered_study_hours, type = "o", col = "green", pch = 16, main = "Mean-Centered Study Hours",
xlab = "Student Index", ylab = "Mean-Centered Hours")
abline(h = 0, col = "red", lwd = 2, lty = 2) # Add a horizontal line at 0
Interpretation: By using mean-centered values in this example of students’ studying time, we can observe that students with positive mean-centered values have studied more than the average student, and students with negative mean-centered values studied less than the average student.
Dataset: - Simulated data on students’ reaction times (in milliseconds):
Tasks:
1. Calculate the mean and standard deviation of the reaction
times.
2. Compute the Z-scores for each reaction time.
3. Plot the Z-scores on a line graph.
4. Interpretation: Discuss what a Z-score greater than
0 or less than 0 indicates about a reaction time relative to the
average.
# Calculate the mean and standard deviation of the reaction times
reaction_times <- c(350, 420, 310, 390, 370, 450, 380, 340, 400, 360)
# Calculate the mean of the reaction times
mean_reaction_time <- mean(reaction_times)
# Calculate the standard deviation of the reaction times
sd_reaction_time <- sd(reaction_times)
# Display mean and standard deviation
mean_reaction_time
## [1] 377
## [1] 40.56545
# Compute the Z-scores
z_scores_reaction_times <- (reaction_times - mean_reaction_time) / sd_reaction_time
# Display the Z-scores
z_scores_reaction_times
## [1] -0.66559107 1.06001542 -1.65165193 0.32046978 -0.17256065 1.79956105
## [7] 0.07395456 -0.91210629 0.56698499 -0.41907586
# Plot the Z-scores
# abline(h= 0)
plot(z_scores_reaction_times, type = "o", col = "purple", pch = 16,
main = "Z-Scores of Reaction Times", xlab = "Index", ylab = "Z-Score")
abline(h = 0, col = "red", lwd = 2, lty = 2) # Reference line at Z=0
Interpretation: Z scores greater than 0 show us that the reaction time is above average, a z-score of less than 0 tells us that the reaction time is below the average, and z score of 0 means the reaction time is exactlt equal to the average.
Dataset: - Simulated data on annual sales figures (in thousands of dollars):
Tasks:
1. Apply a logarithmic transformation to the sales data.
2. Apply a square root transformation to the sales data.
3. Plot histograms of the original and transformed sales data.
4. Interpretation: Compare the distributions of the
original and transformed data. Explain how each transformation affects
the spread and shape of the data.
## [1] 5.298317 6.109248 6.551080 7.090077 5.703782 6.684612 7.003065 6.802395
## [9] 5.991465 7.313220
# Apply a square root transformation
sqrt_sales <- sqrt(sales)
# Display transformed values
sqrt_sales
## [1] 14.14214 21.21320 26.45751 34.64102 17.32051 28.28427 33.16625 30.00000
## [9] 20.00000 38.72983
# Plot histograms of the original and transformed sales data
#use hist()
hist(sales, main="Original Sales Data", xlab="Sales (in thousands)", col="lightblue", breaks=10)
# Plot histogram of log-transformed sales data
hist(log_sales, main="Log-Transformed Sales Data", xlab="Log(Sales)", col="lightgreen", breaks=10)
# Plot histogram of square root-transformed sales data
hist(sqrt_sales, main="Square Root Transformed Sales Data", xlab="Sqrt(Sales)", col="lightcoral", breaks=10)
Interpretation: The logarithmic transformation shortens the spread of larger valyes, as does the square root transformation, just more moderately. - Logarithmic Transformation: The logarithmic function helps make the distribution more symmetic when the data is right-skewed by spreading out smaller values and compressing larger ones. - Square Root Transformation: The square root transformation reduces skewness, but less dramatically. In general, it balances the data. -
Dataset: - Simulated data on daily step counts:
Tasks: 1. Mean-center the step counts.
2. Calculate the Z-scores for the step counts.
3. Plot the original, mean-centered, and Z-scores on separate
graphs.
4. Interpretation: Explain how the combination of
mean-centering and Z-scores helps in understanding the step count data
compared to looking at the original data alone.
# Mean-center the step counts
# Define step count data
step_counts <- c(8000, 10500, 9200, 11500, 10000, 12500, 11000, 9500, 10200, 12000)
# Calculate the mean
mean_steps <- mean(step_counts)
# Mean-center the step counts
mean_centered_steps <- step_counts - mean_steps
# Display the mean-centered values
mean_centered_steps
## [1] -2440 60 -1240 1060 -440 2060 560 -940 -240 1560
# Calculate the Z-scores for the step counts
# Calculate standard deviation
sd_steps <- sd(step_counts)
# Compute the Z-scores
z_scores_steps <- (step_counts - mean_steps) / sd_steps
# Display the Z-scores
z_scores_steps
## [1] -1.78888109 0.04398888 -0.90910351 0.77713687 -0.32258511 1.51028486
## [7] 0.41056287 -0.68915911 -0.17595552 1.14371086
# Plot the original
plot(step_counts, type = "o", col = "blue", pch = 16, main = "Original Step Counts",
xlab = "Index", ylab = "Steps")
abline(h = mean_steps, col = "red", lwd = 2, lty = 2) # Mean line
# Plot the mean-centered
plot(mean_centered_steps, type = "o", col = "green", pch = 16, main = "Mean-Centered Step Counts",
xlab = "Index", ylab = "Mean-Centered Steps")
abline(h = 0, col = "red", lwd = 2, lty = 2) # Reference line at 0
# Plot the Z-scores
plot(z_scores_steps, type = "o", col = "purple", pch = 16, main = "Z-Scores of Step Counts",
xlab = "Index", ylab = "Z-Score")
abline(h = 0, col = "red", lwd = 2, lty = 2) # Reference line at 0
Interpretation: The combination of mean-centering and z-scores make it easier to understand the step count data by showing how each of the values compare to the average. The mean-centering helps by subtracting the avergae from each value, help us to see if the step count is higher or lower than average. Z-scores show how far each value is from the average, in standard deviation. They make the data easier to compare and digest when combined. Submission Instructions:
Ensure to knit your document to HTML format, checking that all content is correctly displayed before submission. Submit the RPubs link to Canvas Assignments.