#First we import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import numpy as npIntro Lab
0. INTRODUCTION
Welcome to your first LASER Learning Lab Case Study! The case study activities included in each learning lab demonstrate how key Learning Analytics (LA) techniques featured in exemplary STEM education research studies can be implemented with Python. Case studies also provide a holistic setting to explore important foundational topics integral to Learning Analytics such as reproducible research, use of APIs, and ethical use of educational data.
In this interactive coding case study, we’ll focus on the following Data Intensive Workflow:
- Reading data in Python (in the Prepare section)
- Preparing and “wrangling” data in a tabular (think spreadsheet!) format (in the Wrangle section)
- Creating some basic plots (in the Explore section)
- Running a model - specifically, a simple regression model (in the Model section)
- Finally, creating a reproducible report of your work you can share with others (in the Communicate section)
(“Learning Analytics Goes to School” [@krumm2018]).
What you are working in now is an Jupiter Notebook file as indicated by the .ipynb file name extension. Here you can read explanation and instructions, and run code in chunks by running each code cell.
Now, let’s get started!
1. PREPARE
First, we need to define and refine a research question and develop an understanding of where your data comes from [@krumm2018]. This part of the process also involves setting up a reproducible research environment [@gandrud2021] so your work can be understood and replicated by other researchers. For now, we’ll focus on just a few parts of this process.
Research Question
In this case study, we’ll be working with data coming from an unpublished research study, which utilized a number of different data sources to understand high school students’ motivation within the context of online courses. For the purpose of this case study, our analysis will be driven by the following research question:
Is there a relationship between the time students spend on a course (as measured through their learning management system) and their final course grade?
Packages 📦
Packages - sometimes referred to as libraries, are shareable collections of code that can contain functions, data, and/or documentation and extend the functionality of the coding language.
We will work with a dataset that was obtained from a learning management system (LMS). We will use libraries pandas (https://pandas.pydata.org/docs/) to read and analyze the data. We will also be using matptlotlib (https://matplotlib.org/stable/users/index.html) package to visualize data distribution, and finally we will use scikit-learn (https://scikit-learn.org/stable/) and numpy (https://numpy.org/devdocs/) to run our regression model.
Click the arrow to execute your code in a cell below.
Loading (or reading in) data
The data we’ll explore in this case study were originally collected for a research study, which utilized a number of different data sources to understand students’ course-related motivation. These courses were designed and taught by instructors through a state-wide online course provider designed to supplement -- but not replace -- students’ enrollment in their local school.
The data used in this case study has already been “wrangled” quite a bit, but the original datasets included:
A self-report survey assessing three aspects of students’ motivation
Log-trace data, such as data output from the learning management system (LMS)
Discussion board data
Academic achievement data
To know more, see Chapter 7 of Data Science in Education Using R [@estrellado2020e].
Next, we’ll load our data - specifically, a CSV (comma separated value) text file, the kind that you can export from Microsoft Excel or Google Sheets - into pandas, using the pd.read_csv() function in the next chunk.
Note: by manipulating data in pandas we do not change the original file.
#read sci-online-classes.csv to sci_data and display the output
sci_data = pd.read_csv("/cloud/project/data/sci-online-classes.csv")
sci_data| student_id | course_id | total_points_possible | total_points_earned | percentage_earned | subject | semester | section | Gradebook_Item | Grade_Category | ... | q7 | q8 | q9 | q10 | TimeSpent | TimeSpent_hours | TimeSpent_std | int | pc | uv | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 43146 | FrScA-S216-02 | 3280 | 2220 | 0.676829 | FrScA | S216 | 2 | POINTS EARNED & TOTAL COURSE POINTS | NaN | ... | 5.0 | 5.0 | 4.0 | 5.0 | 1555.1667 | 25.919445 | -0.180515 | 5.0 | 4.5 | 4.333333 |
| 1 | 44638 | OcnA-S116-01 | 3531 | 2672 | 0.756726 | OcnA | S116 | 1 | ATTEMPTED | NaN | ... | 4.0 | 5.0 | 4.0 | 4.0 | 1382.7001 | 23.045002 | -0.307803 | 4.2 | 3.5 | 4.000000 |
| 2 | 47448 | FrScA-S216-01 | 2870 | 1897 | 0.660976 | FrScA | S216 | 1 | POINTS EARNED & TOTAL COURSE POINTS | NaN | ... | 4.0 | 5.0 | 3.0 | 5.0 | 860.4335 | 14.340558 | -0.693260 | 5.0 | 4.0 | 3.666667 |
| 3 | 47979 | OcnA-S216-01 | 4562 | 3090 | 0.677335 | OcnA | S216 | 1 | POINTS EARNED & TOTAL COURSE POINTS | NaN | ... | 4.0 | 5.0 | 5.0 | 5.0 | 1598.6166 | 26.643610 | -0.148447 | 5.0 | 3.5 | 5.000000 |
| 4 | 48797 | PhysA-S116-01 | 2207 | 1910 | 0.865428 | PhysA | S116 | 1 | POINTS EARNED & TOTAL COURSE POINTS | NaN | ... | 4.0 | 4.0 | NaN | 3.0 | 1481.8000 | 24.696667 | -0.234663 | 3.8 | 3.5 | 3.500000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 598 | 97265 | PhysA-S216-01 | 3101 | 2078 | 0.670106 | PhysA | S216 | 1 | POINTS EARNED & TOTAL COURSE POINTS | NaN | ... | 4.0 | 4.0 | 4.0 | 4.0 | 817.4501 | 13.624168 | -0.724983 | 3.8 | 3.5 | 4.000000 |
| 599 | 97272 | OcnA-S216-01 | 2872 | 1733 | 0.603412 | OcnA | S216 | 1 | POINTS EARNED & TOTAL COURSE POINTS | NaN | ... | 3.0 | 5.0 | 5.0 | 3.0 | 1638.4500 | 27.307500 | -0.119048 | 4.4 | 3.0 | 5.000000 |
| 600 | 97374 | BioA-S216-01 | 8586 | 6978 | 0.812718 | BioA | S216 | 1 | POINTS EARNED & TOTAL COURSE POINTS | NaN | ... | NaN | NaN | NaN | NaN | 470.8000 | 7.846667 | -0.980827 | NaN | NaN | NaN |
| 601 | 97386 | BioA-S216-01 | 2761 | 1937 | 0.701557 | BioA | S216 | 1 | POINTS EARNED & TOTAL COURSE POINTS | NaN | ... | 3.0 | 4.0 | 3.0 | 3.0 | 71.0166 | 1.183610 | -1.275885 | 3.8 | 3.0 | 3.666667 |
| 602 | 97441 | FrScA-S216-02 | 2607 | 2205 | 0.845800 | FrScA | S216 | 2 | POINTS EARNED & TOTAL COURSE POINTS | NaN | ... | 5.0 | 5.0 | 2.0 | 4.0 | 208.6664 | 3.477773 | -1.174293 | 4.4 | 4.0 | 2.000000 |
603 rows × 30 columns
👉 Your Turn ⤵
❓ What rows and columns correspond with? How many students’ data do we possess? How do you know?
- YOUR RESPONSE HERE
Now, let’s examine your data more systematically.
The first step in getting to know your data is to discover the different data types it contains.
Data Types
There are two types of data:
Categorical that represents categories or groups that are distinct and separable. It usually consists of names, labels, or attributes and is represented by words or symbols.
Numerical that represents qualities that can be measured and represented as numbers.
👉 Your Turn ⤵
❓ Which of the columns in our dataset contain categorical and which contain numerical data?
- YOUR RESPONSE HERE
Another way to explore the data types is by using .info() function.
sci_data.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 603 entries, 0 to 602
Data columns (total 30 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 student_id 603 non-null int64
1 course_id 603 non-null object
2 total_points_possible 603 non-null int64
3 total_points_earned 603 non-null int64
4 percentage_earned 603 non-null float64
5 subject 603 non-null object
6 semester 603 non-null object
7 section 603 non-null int64
8 Gradebook_Item 603 non-null object
9 Grade_Category 0 non-null float64
10 FinalGradeCEMS 573 non-null float64
11 Points_Possible 603 non-null int64
12 Points_Earned 511 non-null float64
13 Gender 603 non-null object
14 q1 480 non-null float64
15 q2 477 non-null float64
16 q3 480 non-null float64
17 q4 478 non-null float64
18 q5 476 non-null float64
19 q6 476 non-null float64
20 q7 474 non-null float64
21 q8 474 non-null float64
22 q9 474 non-null float64
23 q10 474 non-null float64
24 TimeSpent 598 non-null float64
25 TimeSpent_hours 598 non-null float64
26 TimeSpent_std 598 non-null float64
27 int 527 non-null float64
28 pc 528 non-null float64
29 uv 528 non-null float64
dtypes: float64(20), int64(5), object(5)
memory usage: 141.5+ KB
👉 Your Turn ⤵
❓ Which data types do you see? Which ones are numerical and which are categorical?
- YOUR RESPONSE HERE
If you look at “Grade_category”, you will notice that all values are NaNs or missing values which means we do not have any information about this variable (or parameter).
👉 Your Turn ⤵
❓ What other columns do you think have missing values? Why do you think so?1
- YOUR RESPONSE HERE
2. WRANGLE
By wrangle, we refer to the process of cleaning and processing data, and, in some cases, merging (or joining) data from multiple sources. Often, this part of the process is very (surprisingly) time-intensive! Wrangling your data into shape can itself be an important accomplishment!
Selecting variables
Recall from our Prepare section that we are interested the relationship between the time students spend on a course and their final course grade.
Run the following code chunk using sci_data[[]] and the names of the columns:
FinalGradeCEMS(i.e., students’ final grades on a 0-100 point scale)TimeSpent(i.e., the number of minutes they spent in the course’s learning management system)
sci_data[['FinalGradeCEMS','TimeSpent']]| FinalGradeCEMS | TimeSpent | |
|---|---|---|
| 0 | 93.453725 | 1555.1667 |
| 1 | 81.701843 | 1382.7001 |
| 2 | 88.487585 | 860.4335 |
| 3 | 81.852596 | 1598.6166 |
| 4 | 84.000000 | 1481.8000 |
| ... | ... | ... |
| 598 | 84.569444 | 817.4501 |
| 599 | 84.239532 | 1638.4500 |
| 600 | 12.352941 | 470.8000 |
| 601 | 54.158289 | 71.0166 |
| 602 | 23.137698 | 208.6664 |
603 rows × 2 columns
Cleaning data
We have already seen that there are missing values in our target columns. There is another way to do that by selecting the column and usung .isnull() function and adding .sum() function in the end to find the number of those values.
sci_data.isnull().sum()student_id 0
course_id 0
total_points_possible 0
total_points_earned 0
percentage_earned 0
subject 0
semester 0
section 0
Gradebook_Item 0
Grade_Category 603
FinalGradeCEMS 30
Points_Possible 0
Points_Earned 92
Gender 0
q1 123
q2 126
q3 123
q4 125
q5 127
q6 127
q7 129
q8 129
q9 129
q10 129
TimeSpent 5
TimeSpent_hours 5
TimeSpent_std 5
int 76
pc 75
uv 75
dtype: int64
Handling Missing Values
There are several conventional ways to deal with the missing values.
We can drop those values and not use the entire row in which this element is missing.
We can substitute missing values with the column mean if the variance within a row is not very big.
Below are the code chunks to execute both ways. Choose one you think is more appropriate as executing one excludes the use of the other and you will add changes to the dataset.
Drop missing values
#drop missing values
sci_data = sci_data.dropna(subset = ['FinalGradeCEMS','TimeSpent'])
sci_data.info()<class 'pandas.core.frame.DataFrame'>
Index: 573 entries, 0 to 602
Data columns (total 30 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 student_id 573 non-null int64
1 course_id 573 non-null object
2 total_points_possible 573 non-null int64
3 total_points_earned 573 non-null int64
4 percentage_earned 573 non-null float64
5 subject 573 non-null object
6 semester 573 non-null object
7 section 573 non-null int64
8 Gradebook_Item 573 non-null object
9 Grade_Category 0 non-null float64
10 FinalGradeCEMS 573 non-null float64
11 Points_Possible 573 non-null int64
12 Points_Earned 486 non-null float64
13 Gender 573 non-null object
14 q1 465 non-null float64
15 q2 462 non-null float64
16 q3 465 non-null float64
17 q4 463 non-null float64
18 q5 461 non-null float64
19 q6 461 non-null float64
20 q7 459 non-null float64
21 q8 460 non-null float64
22 q9 459 non-null float64
23 q10 460 non-null float64
24 TimeSpent 573 non-null float64
25 TimeSpent_hours 573 non-null float64
26 TimeSpent_std 573 non-null float64
27 int 503 non-null float64
28 pc 504 non-null float64
29 uv 504 non-null float64
dtypes: float64(20), int64(5), object(5)
memory usage: 138.8+ KB
Substitute with column means
#substitute missing values with column means
mean_value_grade=sci_data['FinalGradeCEMS'].mean()
mean_value_time=sci_data['TimeSpent'].mean()
sci_data['FinalGradeCEMS'].fillna(value=mean_value_grade, inplace=True)
sci_data['TimeSpent'].fillna(value=mean_value_time, inplace=True)/tmp/ipykernel_5274/3740110450.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
sci_data['FinalGradeCEMS'].fillna(value=mean_value_grade, inplace=True)
/tmp/ipykernel_5274/3740110450.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
sci_data['TimeSpent'].fillna(value=mean_value_time, inplace=True)
Filtering variables
Next, let’s explore filtering variables. Check out and run the next chunk of code, imagining that we wish to filter our data to view only the rows associated with students who earned a final grade (as a percentage) of 70% - or higher and the ‘TimeSpent’ associated with it.
sci_data['TimeSpent'][sci_data['FinalGradeCEMS']>70]0 1555.1667
1 1382.7001
2 860.4335
3 1598.6166
4 1481.8000
...
592 244.3835
593 2264.4834
595 2676.7501
598 817.4501
599 1638.4500
Name: TimeSpent, Length: 438, dtype: float64
👉 Your Turn ⤵
❓ How much time do you need to spend to get the grade higher than 70%? Is there a consistent pattern?
- YOUR RESPONSE HERE
3. EXPLORE
Exploratory data analysis, or exploring your data, involves processes of describing your data (such as by calculating the means and standard deviations of numeric variables, or counting the frequency of categorical variables) and, often, visualizing your data. As we’ll learn in later labs, the explore phase can also involve the process of “feature engineering,” or creating new variables within a dataset [@krumm2018].
In this section, we’ll quickly pull together some basic stats and introduce you to a basic data visualization.
Summary Statistics
Let’s repurpose what we learned from our wrangle section to select just a few variables and quickly gather some descriptive statistics to see where the data is centered, its values to identify trends by using .describe() method.
sci_data.describe()| student_id | total_points_possible | total_points_earned | percentage_earned | section | Grade_Category | FinalGradeCEMS | Points_Possible | Points_Earned | q1 | ... | q7 | q8 | q9 | q10 | TimeSpent | TimeSpent_hours | TimeSpent_std | int | pc | uv | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 573.000000 | 573.000000 | 573.000000 | 573.000000 | 573.000000 | 0.0 | 573.000000 | 573.000000 | 486.000000 | 465.000000 | ... | 459.000000 | 460.000000 | 459.000000 | 460.000000 | 573.000000 | 573.000000 | 573.000000 | 503.000000 | 504.000000 | 504.000000 |
| mean | 86231.424084 | 4303.863874 | 3267.818499 | 0.758350 | 1.364747 | NaN | 77.202655 | 78.235602 | 69.417905 | 4.290323 | ... | 3.910675 | 4.284783 | 3.492375 | 4.104348 | 1873.849317 | 31.230822 | 0.054687 | 4.215540 | 3.603009 | 3.730974 |
| std | 10392.690030 | 2305.719003 | 1824.242324 | 0.089552 | 0.675195 | NaN | 22.225076 | 169.019143 | 145.754144 | 0.685135 | ... | 0.821833 | 0.688678 | 0.985399 | 0.937004 | 1335.608192 | 22.260137 | 0.985739 | 0.595033 | 0.644908 | 0.703211 |
| min | 43146.000000 | 840.000000 | 672.000000 | 0.466447 | 1.000000 | NaN | 0.000000 | 5.000000 | 0.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.700000 | 0.011667 | -1.327782 | 2.000000 | 1.500000 | 1.000000 |
| 25% | 85644.000000 | 2832.000000 | 2059.000000 | 0.704907 | 1.000000 | NaN | 71.251142 | 10.000000 | 7.600000 | 4.000000 | ... | 3.000000 | 4.000000 | 3.000000 | 4.000000 | 943.066900 | 15.717782 | -0.632272 | 3.900000 | 3.000000 | 3.333333 |
| 50% | 88340.000000 | 3641.000000 | 2777.000000 | 0.776543 | 1.000000 | NaN | 84.569444 | 10.000000 | 10.000000 | 4.000000 | ... | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 1598.616600 | 26.643610 | -0.148447 | 4.200000 | 3.500000 | 3.791667 |
| 75% | 92733.000000 | 5083.000000 | 3882.000000 | 0.826181 | 2.000000 | NaN | 92.099323 | 30.000000 | 26.535000 | 5.000000 | ... | 4.500000 | 5.000000 | 4.000000 | 5.000000 | 2454.333200 | 40.905553 | 0.483111 | 4.700000 | 4.000000 | 4.166667 |
| max | 97441.000000 | 15552.000000 | 12208.000000 | 0.910640 | 4.000000 | NaN | 100.000000 | 935.000000 | 828.200000 | 5.000000 | ... | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 8870.883300 | 147.848055 | 5.218815 | 5.000000 | 5.000000 | 5.000000 |
8 rows × 25 columns
👉 Your Turn ⤵
❓ What do you notice about this dataset? Which columns are most important for our research question? What do you wonder?
- YOUR RESPONSE HERE
Data Visualization
Data visualization is an extremely common practice in Learning Analytics, especially in the use of data dashboards. Data visualization involves graphically representing one or more variables with the goal of discovering patterns in data. These patterns may help us to answer research questions or generate new questions about our data, to discover relationships between and among variables, and to create or select features for data modeling.
The Graphing Workflow
At it’s core, you can create some very simple but attractive graphs with just a couple lines of code. Matplotlib follows the common workflow for making graphs. To make a graph, you simply:
Start the graph with
pltand include type of graph `.hist()’in our case, and add the data as an argument;“Add” elements to the graph using the
bins =, or changing the color;Select variables to graph on each axis with the
xlabel()argument.
Let’s give it a try by creating a simple histogram of our FinalGradeCEMS variable. The code below creates a histogram, or a distribution of the values, in this case for students’ final grades. Go ahead and run it:
plt.hist(sci_data['FinalGradeCEMS'], bins = 30, color ="skyblue")
plt.xlabel('FinalGradeCEMS')
plt.show()👉 Your Turn ⤵
Now use the code chunk below to visualize the distribution of another variable in the data, specifically TimeSpent. You can do so by swapping out the variable FinalGradeCEMS with our new variable. Also, change the color to one of your choosing; consider this list of valid color names here: https://matplotlib.org/stable/gallery/color/named_colors.html
Tip: There is no shame in copying and pasting code from above. Remember, reproducible research is also intended to help you save time!
#Your code goes here:Scatterplots
Finally, let’s create a scatter plot for the relationship between these two variables. Scatterplots are most useful for displaying the relationship between two continuous variables. Change type of graph by typing .scatter(). You can also choose the size of the marker and color.
👉 Your Turn ⤵
Complete the code chunk below to create a simple scatterplot with TimeSpent on the x axis and FinalGradeCEMS on the y axis.
plt.scatter(x = sci_data['TimeSpent'], y = sci_data['FinalGradeCEMS'], marker=".", color = '#88c999')
plt.xlabel('TimeSpent')
plt.ylabel('FinalGradeCEMS')
plt.show()Well done! As you can see, there appears to be a positive relationship between the time students spend in the online course and their final grade!
4. MODEL
“Model” is one of those terms that has many different meanings. For our purpose, we refer to the process of simplifying and summarizing our data. Thus, models can take many forms; calculating means represents a legitimate form of modeling data, as does estimating more complex models, including linear regressions, and models and algorithms associated with machine learning tasks. For now, we’ll run a base linear regression model to further examine the relationship between TimeSpent and FinalGradeCEMS.
We’ll dive much deeper into modeling in subsequent learning labs, but for now let’s see if there is a statistically significant relationship between students’ final grades, FinaGradeCEMS, and the TimeSpent on the course.
We need to do some transformations of our two parameters before we can run the model, in our case it is linear regression. We encode the data from the two columns FinaGradeCEMS and TimeSpent into the format our scikit-learn linear regression model could understand.
*Note that on the y-axis we put a dependent variable, the one we want to predict and x-axis is for and independent, or exploratory variable.
We then train (or fit) or model to the data we selected to find a line of best fit.
We also added the code to visualize our prediction in the scatter plot and changed its color to understand better what is being predicted.
#dependent variable is what you want to predict - y axis; independent(exploratory) variable
X = np.array(sci_data['TimeSpent']).reshape(-1, 1)
y = np.array(sci_data['FinalGradeCEMS']).reshape(-1, 1)
reg = LinearRegression().fit(X, y)
plt.scatter(X, y, marker=".", color = '#88c999')
plt.plot(X, reg.predict(X),color='hotpink')
plt.xlabel('TimeSpent')
plt.ylabel('FinalGradeCEMS')
plt.show()We can now use our model to predict the grade depending on the time spent studying.
print(reg.predict([[4000]]))[[90.13094602]]
#| standalone: true
from shiny import *
app_ui = ui.page_fluid(
ui.input_slider("n", "N", 0, 100, 40),
ui.output_text_verbatim("txt"),
)
def server(input, output, session):
@output
@render.text
def txt():
return f"The value of n*2 is {input.n() * 2}"
app = App(app_ui, server)
Footnotes
We can see that there are totally 603 rows in each column and if we see that non-null count is lower, it means there are missing values.↩︎