- Perform descriptive and predictive analytics techniques.
- Create and save datasets manually.
- Interpret and visualize your data using different plot types.
- Understand the importance of data cleaning and transformation for analysis.
Part 1: Creating dataset & datafile
Instead of loading an existing dataset, we’ll manually create our own simulated data to practice with. This will help you understand the structure of data in R and the importance of data management.
###Setup
Let’s start by loading the necessary libraries. If you encounter any errors, revisit our first module for guidance on installing packages.
Creating Simulated Data
Task 1: We’ll simulate data for 40 students’ time spent on a learning management system (LMS) over 16 weeks. We will then save this new dataset as a CSV file.
The set.seed() function is used here to ensure that the random data we generate is the same every time you run the code. Think of it as “shuffling a deck of cards in the same way” so that we all get the same results.
# Creating a dataset called `data2` with simulated data for 40 students for our exercise this week.# The first column, 'Student_ID', generates a unique ID for each student by # combining the word "Student" with numbers 1 through 40.# # The subsequent columns, 'Week_1' through 'Week_16', represent data for 16 weeks.## Each week’s column contains 40 random values, sampled from the range 6 to 20, # representing hypothetical time spent (in hours) on the learning management system by each student during that week.set.seed(42) # Setting seed for reproducibilitydata2 <-data.frame(Student_ID =paste("Student", 1:40, sep ="_"),Week_1 =sample(6:20, 40, replace =TRUE),Week_2 =sample(6:20, 40, replace =TRUE),Week_3 =sample(6:20, 40, replace =TRUE),Week_4 =sample(6:20, 40, replace =TRUE),Week_5 =sample(6:20, 40, replace =TRUE),Week_6 =sample(6:20, 40, replace =TRUE),Week_7 =sample(6:20, 40, replace =TRUE),Week_8 =sample(6:20, 40, replace =TRUE),Week_9 =sample(6:20, 40, replace =TRUE),Week_10 =sample(6:20, 40, replace =TRUE),Week_11 =sample(6:20, 40, replace =TRUE),Week_12 =sample(6:20, 40, replace =TRUE),Week_13 =sample(6:20, 40, replace =TRUE),Week_14 =sample(6:20, 40, replace =TRUE),Week_15 =sample(6:20, 40, replace =TRUE),Week_16 =sample(6:20, 40, replace =TRUE)# TYPE YOUR CODE TO COMPLETE)# Saving the dataset as a CSV file. # I named the file '40_students_LMS_time_spent.csv'. You can name the file differently if you'd like.# The row.names = FALSE argument prevents R from writing an unnecessary column of row numbers.write.csv(data2,"X40_students_LMS_time_spent.csv", row.names =FALSE)# Inspect the first few rows of the dataset# TYPE YOUR CODEhead(data2)
Once your data is created and saved as a CSV file, you should see the new file in your Files Pane.
Reflect & Respond
Question: Why do you think it’s important to be able to manually create and save datasets?
[I think it is important to manually create and save datasets because it helps in keeping your work organized. If you can manually create datasets, they can be made specific for your projects, as well as keeping your data clear and specific. You would have control over your data and would be able to ensure it meets all the requirements.]
Part 2: Descriptive Analytics & Visualization
Descriptive Analytics
Now that we have our dataset, let’s conduct some basic analytics to better understand the data.
Task 2: Calculate summary statistics for each week. Use the summary() function to analyze the dataset, excluding the first column which is the Student_ID.
# Summary statistics for each weeksummary_stats <-summary('X40_students_LMS_time_spent'[-1])# summary() function calculates summary statistics (such as minimum, maximum, median, mean, etc.)# data2[, -1]: excluding the first column (-1), which is typically the Student_ID column. summary_statssummary_stats
Length Class Mode
0 character character
Question: What insights do you gain from the summary results?
[I notice the minimum is always six hours. I also noticed the mean is approximately 13 during the 20 week period.]
Task 3: Calculate the average time spent per week for all students using the colMeans() function.
# Calculate the average time spent per weekaverage_time <-colMeans(data2[,-1])# The colMeans() function computes the mean for each column in the data2 dataframe, excluding the first column (-1), which is typically the Student_ID column. average_time
[It shows the average time spent per week for all students for week 1 - 16.]
Question: If some weeks show significantly higher or lower time spent, what actions would you take as an instructor or course designer?
[If some weeks show significantly higher or lower time spent, I would look into potential issues with the course material such as it being easier or harder during that specific week. I would look into the workload during those weeks. Also, I check students’ engagement in the course during that time.]
Task 4: Calculate each individual’s average time spent across 16 weeks. Use the rowMeans() function and add this new variable to the data2 data frame.
# Calculate the mean time spent for each student across all 16 weeks# The rowMeans() function calculates the mean for each row (i.e., each student)data2$Semester_Average <-rowMeans(data2[, 2:17])# rowMeans(data2[, 2:17]): This function calculates the mean of each row across columns 2 to 17, which correspond to the weekly time spent values for each student. The result is stored in the 'Mean_TimeSpent' column.# Inspect the first few rows to see the new column with mean time spenthead(data2)
Task 5: Calculate the average time spent for each student only from week 1 to week 5. Add this as a new variable named early_semester_average.
# Now, complete the code to calculate the mean time spent for each student ONLY from week 1 to week 5. Save the result to a variable named 'early_semester_average'# Revise the code to choose from week 1 to week 5.# COMPLETE THE CODE BELOWdata2$early_semester_average <-rowMeans(data2[2:6])# Inspect the first few rows to see the new column#TYPE YOUR CODEhead(data2)
Question: How do these newly added variables provide you with new insights? What could you do with this information as an instructor?
[This variable provides insight for the first five weeks of the course. This knowledge could help set the tone for the remainder of the semester. It could enable the instructor to pinpoint if certain students were having trouble with the learning management system.]
Data Visualization
Visualizing data helps to better understand and communicate your findings. Let’s create some plots with our data.
Bar Plot of Average Time Spent Per Week
For this data, a bar plot is an excellent way to show the average time spent each week.
# First, reshape the average_time vector into a data frame for ggplotaverage_time_table <-data.frame(Week =factor(names(average_time), levels =names(average_time)),Average_Time_Spent = average_time)ggplot(average_time_table, aes(x = Week, y = Average_Time_Spent)) +geom_bar(stat ="identity") +labs(title ="Average Time Spent per Week", y ="Average Time Spent")
Task 6: Create a bar plot using the average_time_table data frame. Experiment with different colors and text for the parameters (fill, color, size, face, etc.).
# Create a boxplot of 'Average Time Spent' by 'Week'ggplot(average_time_table, aes(x = Week, y = Average_Time_Spent)) +# Add the boxplot to visualize the distribution of average time spent across different weeksgeom_bar(stat ="identity", fill ="steelblue", color ="black") +# Add titles and labels to the plotlabs(title ="Boxplot of Average Time Spent/Week", x ="Week", y ="Average Time Spent") +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),axis.title.x =element_text(size =14, face ="bold"),axis.title.y =element_text(size =14, face ="bold"),axis.text.x =element_text(size =10, angle =45, hjust =1) )
Click If you want to see the full image in a new tab.
Line Plot of Average Time Spent Per Week
Task 7: Line plots are great for showing trends over time. Create a line plot to visualize the trend of average time spent per week using average_time_table.
# Create a line plotggplot(average_time_table, aes(x = Week, y = Average_Time_Spent, group =1)) +geom_line(color ="blue", size =1.2) +geom_point(color ="red", size =3) +labs(title ="Trend of Average Time Spent per Week",x ="Week",y ="Average Time Spent (Hours)") +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),axis.title.x =element_text(size =14, face ="bold"),axis.title.y =element_text(size =14, face ="bold"),axis.text.x =element_text(size =10, angle =45, hjust =1) )
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Question: What differences do you notice between the bar plot and the line plot? Which one is more effective for showing a trend over time, and why?
[Bar plots use solid, discrete bars to compare values across different categories. The bars are typically separated by small gaps to emphasize that each category is distinct. Line plots use connected data points to show how a value changes continuously over time or another ordered variable. The connecting line emphasizes the movement or trend between each point. A line plot is more effective for showing a trend over time.]
Task 7-2: If you are interested in the trends of each student’s time spent from week 1 to week 16, a line plot can be helpful.
# Reshape the data for easier plotting data_long <- data2 %>%pivot_longer(cols =starts_with("Week"), names_to ="Week", values_to ="TimeSpent")# Create a line plot for each student's weekly TimeSpentggplot(data_long, aes(x = Week, y = TimeSpent, group = Student_ID, color = Student_ID)) +geom_line() +labs(title ="Weekly Time Spent by Each Student",x ="Week",y ="Time Spent (Hours)") +theme_minimal() +theme(plot.title =element_text(size =16, face ="bold", hjust =0.5),axis.title.x =element_text(size =12, face ="bold"),axis.title.y =element_text(size =12, face ="bold"),axis.text.x =element_text(angle =45, hjust =1),legend.position ="none"# Hides the legend to reduce clutter )
Question: What do you think about the analytics result?
[I believe the analytical results are very difficult to read. Even though the lines are in different colors, there are so many lines in such a small space.]
Task 7-3: To better interpret the analytics, let’s focus on a subset of students and create a line plot.
# Select 5 specific studentsselected_students <- data2 %>%filter(Student_ID %in%c("Student_8", "Student_13", "Student_14", "Student_27", "Student_38")) #TYPE YOUR CODE # Reshape the data for easier plottingdata_long_selected <- selected_students %>%pivot_longer(cols =starts_with("Week"), names_to ="Week", values_to ="TimeSpent")# Create a line plot for the selected studentsggplot(data_long_selected, aes(x = Week, y = TimeSpent, group = Student_ID, color = Student_ID)) +geom_line(size =1.2) +geom_point(size =2) +labs(title ="Weekly Time Spent by Selected Students",x ="Week",y ="Time Spent (Hours)") +theme_minimal() +theme(plot.title =element_text(size =16, face ="bold", hjust =0.5),axis.title.x =element_text(size =12, face ="bold"),axis.title.y =element_text(size =12, face ="bold"),axis.text.x =element_text(angle =45, hjust =1),legend.title =element_blank(), # Hides the legend title for simplicitylegend.position ="top"# Position the legend at the top for better visibility )
Question: Pick 3-6 students you are particularly interested in comparing. Update the R code above (Task 7-3 chunk) with your selected students. What insights do you gain from this more focused analysis?
[From this more focused analysis on five selected students, I can much more easily see the history of time spent for each student on the management system each week. It appears that the amount of time spent each week is more than likely dependent on rigger of the week’s curriculum.]
Question: What changes did you make to the visualization? Why?
[The visual is much less cluttered and simpler to understand. The focus on less students enables easier interpretation.]
Histogram
Task 8: A histogram is useful for understanding the overall distribution of a single variable. Create a histogram of the Semester_Average variable to see the general pattern and frequency of time spent across all students.
# Histogram of Mean_TimeSpentggplot(data2, aes(x = Semester_Average)) +geom_histogram(binwidth =1, fill ="skyblue", color ="black") +labs(title ="Histogram of Mean Time Spent by 40 Students",x ="Mean Time Spent (Hours)",y ="Frequency") +theme_minimal() +theme(plot.title =element_text(size =16, face ="bold", hjust =0.5),axis.title.x =element_text(size =12, face ="bold"),axis.title.y =element_text(size =12, face ="bold") )
Question: Which plot do you find more insightful, the bar plot, box plot, or histogram? Why?
[I find a bar plot to be more insightful because the height of each bar directly represents the value of its category, allowing clear comparison to see which categories are largest or smallest at a glance]
Part 3: Predictive Analytics and Visualization
Now, let’s switch gears and explore the relationship between two variables. We will return to the dataset from the previous module, sci-online-classes.csv. We want to see if there is a relationship between the time spent on the LMS (TimeSpent_hours) and students’ final grades (FinalGradeCEMS).
Load data
First, we need to load the data we used in our first module.
#import/load the dataset# COMPLETE THE CODE WITH THE FUNCTION NAME (read_csv) & THE FILE NAME (sci-online-classes.csv).sci_online_classes <-read.csv("data/sci-online-classes.csv")# Inspect your datastr(data)
function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"),
envir = .GlobalEnv, overwrite = TRUE)
# Display the first few rows of the dataset# COMPLETE THE CODEhead(data)
1 function (..., list = character(), package = NULL, lib.loc = NULL,
2 verbose = getOption("verbose"), envir = .GlobalEnv, overwrite = TRUE)
3 {
4 fileExt <- function(x) {
5 db <- grepl("\\\\.[^.]+\\\\.(gz|bz2|xz)$", x)
6 ans <- sub(".*\\\\.", "", x)
Visualize Relationships
Task 9: To explore the relationship between two variables (e.g., TimeSpent_hours and FinalGradeCEMS), create a scatter plot with a regression line. This visual will help us see if one variable might predict another.
# Create a scatter plot of TimeSpent_hours vs. FinalGradeCEMS with a regression lineggplot(data = sci_online_classes, aes(x = TimeSpent_hours, y = FinalGradeCEMS )) +# TYPE YOUR CODEgeom_point(color ="blue", size =3, alpha =0.6) +geom_smooth(method ="lm", color ="red", se =TRUE) +# This line will add a linear regression linelabs(title ="Scatter Plot of Time Spent vs. Final Grade with Regression Line",x ="Time Spent (Hours)",y ="Final Grade (CEMS)") +theme_minimal() +theme(plot.title =element_text(size =16, face ="bold", hjust =0.5),axis.title.x =element_text(size =12, face ="bold"),axis.title.y =element_text(size =12, face ="bold") )
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 30 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 30 rows containing missing values or values outside the scale range
(`geom_point()`).
Reflect & Respond
Question: Based on the scatter plot, what do you expect the relationship between time spent and final grades to be? Write down your hypothesis. - [If a student increases their time spent using the learning management system, then their final grade will also increase.]
Correlation
Task 10: After visualizing the data, let’s quantify the relationship by computing the correlation between TimeSpent_hours and FinalGradeCEMS. The cor() function is used for this. The argument use = “complete.obs” tells R to ignore any rows with missing data when performing the calculation.
# Compute the correlationcorrelation <-cor(sci_online_classes$TimeSpent_hours, sci_online_classes$FinalGradeCEMS, use ="complete.obs")# Display the correlationcorrelation
[1] 0.3654121
Question: With the scatter plot and correlation results in mind, what insights can you draw about the relationship between time spent and final grades? Remember, this is NOT a traditional statistics course, so focus on interpreting the data in context.
[From viewing the scatter plot, there appears to be a positive correlation between the time spent using the management system and final grades. In other words, as time on the management system increases, final grades will also increase.]
Render & Submit
Congratulations, you’ve completed the second module!
To receive full score, you will need to render this document and publish via a method such as: Quarto Pub, Posit Cloud, RPubs , GitHub Pages, or other methods. Once you have shared a link to you published document with me and I have reviewed your work, you will be officially done with the current module.
Complete the following steps to submit your work for review by:
First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Next, click the “Render” button in the toolbar above to “render” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let me know if you run into any issues with rendering.
Finally, publish. To do publish, follow the step from the link
If you have any questions about this module, or run into any technical issues, don’t hesitate to contact me.
Once I have checked your link, you will be notified!