Intro Lab

Author

Daria Smyslova

0. INTRODUCTION

Welcome to your first LASER Learning Lab Case Study! The case study activities included in each learning lab demonstrate how key Learning Analytics (LA) techniques featured in exemplary STEM education research studies can be implemented with Python. Case studies also provide a holistic setting to explore important foundational topics integral to Learning Analytics such as reproducible research, use of APIs, and ethical use of educational data.

In this interactive coding case study, we’ll focus on the following Data Intensive Workflow:

Reading data in Python (in the Prepare section)
Preparing and “wrangling” data in a tabular (think spreadsheet!) format (in the Wrangle section)
Creating some basic plots (in the Explore section)
Running a model - specifically, a simple regression model (in the Model section)
Finally, creating a reproducible report of your work you can share with others (in the Communicate section)

(“Learning Analytics Goes to School” [@krumm2018]).

What you are working in now is an Jupiter Notebook file as indicated by the .ipynb file name extension. Here you can read explanation and instructions, and run code in chunks by running each code cell.

Now, let’s get started!

1. PREPARE

First, we need to define and refine a research question and develop an understanding of where your data comes from [@krumm2018]. This part of the process also involves setting up a reproducible research environment [@gandrud2021] so your work can be understood and replicated by other researchers. For now, we’ll focus on just a few parts of this process.

Research Question

In this case study, we’ll be working with data coming from an unpublished research study, which utilized a number of different data sources to understand high school students’ motivation within the context of online courses. For the purpose of this case study, our analysis will be driven by the following research question:

Is there a relationship between the time students spend on a course (as measured through their learning management system) and their final course grade?

Packages 📦

Packages - sometimes referred to as libraries, are shareable collections of code that can contain functions, data, and/or documentation and extend the functionality of the coding language.

We will work with a dataset that was obtained from a learning management system (LMS). We will use libraries pandas (https://pandas.pydata.org/docs/) to read and analyze the data. We will also be using matptlotlib (https://matplotlib.org/stable/users/index.html) package to visualize data distribution, and finally we will use scikit-learn (https://scikit-learn.org/stable/) and numpy (https://numpy.org/devdocs/) to run our regression model.

Click the arrow to execute your code in a cell below.

#First we import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import numpy as np

Loading (or reading in) data

The data we’ll explore in this case study were originally collected for a research study, which utilized a number of different data sources to understand students’ course-related motivation. These courses were designed and taught by instructors through a state-wide online course provider designed to supplement -- but not replace -- students’ enrollment in their local school.

The data used in this case study has already been “wrangled” quite a bit, but the original datasets included:

A self-report survey assessing three aspects of students’ motivation
Log-trace data, such as data output from the learning management system (LMS)
Discussion board data
Academic achievement data

To know more, see Chapter 7 of Data Science in Education Using R [@estrellado2020e].

Next, we’ll load our data - specifically, a CSV (comma separated value) text file, the kind that you can export from Microsoft Excel or Google Sheets - into pandas, using the pd.read_csv() function in the next chunk.

Note: by manipulating data in pandas we do not change the original file.

#read sci-online-classes.csv to sci_data and display the output
sci_data = pd.read_csv("/cloud/project/data/sci-online-classes.csv")
sci_data

	student_id	course_id	total_points_possible	total_points_earned	percentage_earned	subject	semester	section	Gradebook_Item	Grade_Category	...	q7	q8	q9	q10	TimeSpent	TimeSpent_hours	TimeSpent_std	int	pc	uv
0	43146	FrScA-S216-02	3280	2220	0.676829	FrScA	S216	2	POINTS EARNED & TOTAL COURSE POINTS	NaN	...	5.0	5.0	4.0	5.0	1555.1667	25.919445	-0.180515	5.0	4.5	4.333333
1	44638	OcnA-S116-01	3531	2672	0.756726	OcnA	S116	1	ATTEMPTED	NaN	...	4.0	5.0	4.0	4.0	1382.7001	23.045002	-0.307803	4.2	3.5	4.000000
2	47448	FrScA-S216-01	2870	1897	0.660976	FrScA	S216	1	POINTS EARNED & TOTAL COURSE POINTS	NaN	...	4.0	5.0	3.0	5.0	860.4335	14.340558	-0.693260	5.0	4.0	3.666667
3	47979	OcnA-S216-01	4562	3090	0.677335	OcnA	S216	1	POINTS EARNED & TOTAL COURSE POINTS	NaN	...	4.0	5.0	5.0	5.0	1598.6166	26.643610	-0.148447	5.0	3.5	5.000000
4	48797	PhysA-S116-01	2207	1910	0.865428	PhysA	S116	1	POINTS EARNED & TOTAL COURSE POINTS	NaN	...	4.0	4.0	NaN	3.0	1481.8000	24.696667	-0.234663	3.8	3.5	3.500000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
598	97265	PhysA-S216-01	3101	2078	0.670106	PhysA	S216	1	POINTS EARNED & TOTAL COURSE POINTS	NaN	...	4.0	4.0	4.0	4.0	817.4501	13.624168	-0.724983	3.8	3.5	4.000000
599	97272	OcnA-S216-01	2872	1733	0.603412	OcnA	S216	1	POINTS EARNED & TOTAL COURSE POINTS	NaN	...	3.0	5.0	5.0	3.0	1638.4500	27.307500	-0.119048	4.4	3.0	5.000000
600	97374	BioA-S216-01	8586	6978	0.812718	BioA	S216	1	POINTS EARNED & TOTAL COURSE POINTS	NaN	...	NaN	NaN	NaN	NaN	470.8000	7.846667	-0.980827	NaN	NaN	NaN
601	97386	BioA-S216-01	2761	1937	0.701557	BioA	S216	1	POINTS EARNED & TOTAL COURSE POINTS	NaN	...	3.0	4.0	3.0	3.0	71.0166	1.183610	-1.275885	3.8	3.0	3.666667
602	97441	FrScA-S216-02	2607	2205	0.845800	FrScA	S216	2	POINTS EARNED & TOTAL COURSE POINTS	NaN	...	5.0	5.0	2.0	4.0	208.6664	3.477773	-1.174293	4.4	4.0	2.000000

603 rows × 30 columns

👉 Your Turn ⤵

❓ What rows and columns correspond with? How many students’ data do we possess? How do you know?

YOUR RESPONSE HERE

Now, let’s examine your data more systematically.

The first step in getting to know your data is to discover the different data types it contains.

Data Types

There are two types of data:

Categorical that represents categories or groups that are distinct and separable. It usually consists of names, labels, or attributes and is represented by words or symbols.
Numerical that represents qualities that can be measured and represented as numbers.

👉 Your Turn ⤵

❓ Which of the columns in our dataset contain categorical and which contain numerical data?

YOUR RESPONSE HERE

Another way to explore the data types is by using .info() function.

sci_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 603 entries, 0 to 602
Data columns (total 30 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   student_id             603 non-null    int64  
 1   course_id              603 non-null    object 
 2   total_points_possible  603 non-null    int64  
 3   total_points_earned    603 non-null    int64  
 4   percentage_earned      603 non-null    float64
 5   subject                603 non-null    object 
 6   semester               603 non-null    object 
 7   section                603 non-null    int64  
 8   Gradebook_Item         603 non-null    object 
 9   Grade_Category         0 non-null      float64
 10  FinalGradeCEMS         573 non-null    float64
 11  Points_Possible        603 non-null    int64  
 12  Points_Earned          511 non-null    float64
 13  Gender                 603 non-null    object 
 14  q1                     480 non-null    float64
 15  q2                     477 non-null    float64
 16  q3                     480 non-null    float64
 17  q4                     478 non-null    float64
 18  q5                     476 non-null    float64
 19  q6                     476 non-null    float64
 20  q7                     474 non-null    float64
 21  q8                     474 non-null    float64
 22  q9                     474 non-null    float64
 23  q10                    474 non-null    float64
 24  TimeSpent              598 non-null    float64
 25  TimeSpent_hours        598 non-null    float64
 26  TimeSpent_std          598 non-null    float64
 27  int                    527 non-null    float64
 28  pc                     528 non-null    float64
 29  uv                     528 non-null    float64
dtypes: float64(20), int64(5), object(5)
memory usage: 141.5+ KB

👉 Your Turn ⤵

❓ Which data types do you see? Which ones are numerical and which are categorical?

YOUR RESPONSE HERE

If you look at “Grade_category”, you will notice that all values are NaNs or missing values which means we do not have any information about this variable (or parameter).

👉 Your Turn ⤵

❓ What other columns do you think have missing values? Why do you think so?¹

YOUR RESPONSE HERE

2. WRANGLE

By wrangle, we refer to the process of cleaning and processing data, and, in some cases, merging (or joining) data from multiple sources. Often, this part of the process is very (surprisingly) time-intensive! Wrangling your data into shape can itself be an important accomplishment!

Selecting variables

Recall from our Prepare section that we are interested the relationship between the time students spend on a course and their final course grade.

Run the following code chunk using sci_data[[]] and the names of the columns:

FinalGradeCEMS (i.e., students’ final grades on a 0-100 point scale)
TimeSpent (i.e., the number of minutes they spent in the course’s learning management system)

sci_data[['FinalGradeCEMS','TimeSpent']]

	FinalGradeCEMS	TimeSpent
0	93.453725	1555.1667
1	81.701843	1382.7001
2	88.487585	860.4335
3	81.852596	1598.6166
4	84.000000	1481.8000
...	...	...
598	84.569444	817.4501
599	84.239532	1638.4500
600	12.352941	470.8000
601	54.158289	71.0166
602	23.137698	208.6664

603 rows × 2 columns

Cleaning data

We have already seen that there are missing values in our target columns. There is another way to do that by selecting the column and usung .isnull() function and adding .sum() function in the end to find the number of those values.

sci_data.isnull().sum()

student_id                 0
course_id                  0
total_points_possible      0
total_points_earned        0
percentage_earned          0
subject                    0
semester                   0
section                    0
Gradebook_Item             0
Grade_Category           603
FinalGradeCEMS            30
Points_Possible            0
Points_Earned             92
Gender                     0
q1                       123
q2                       126
q3                       123
q4                       125
q5                       127
q6                       127
q7                       129
q8                       129
q9                       129
q10                      129
TimeSpent                  5
TimeSpent_hours            5
TimeSpent_std              5
int                       76
pc                        75
uv                        75
dtype: int64

Handling Missing Values

There are several conventional ways to deal with the missing values.

We can drop those values and not use the entire row in which this element is missing.
We can substitute missing values with the column mean if the variance within a row is not very big.

Below are the code chunks to execute both ways. Choose one you think is more appropriate as executing one excludes the use of the other and you will add changes to the dataset.

Drop missing values

#drop missing values

sci_data = sci_data.dropna(subset = ['FinalGradeCEMS','TimeSpent'])
sci_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 573 entries, 0 to 602
Data columns (total 30 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   student_id             573 non-null    int64  
 1   course_id              573 non-null    object 
 2   total_points_possible  573 non-null    int64  
 3   total_points_earned    573 non-null    int64  
 4   percentage_earned      573 non-null    float64
 5   subject                573 non-null    object 
 6   semester               573 non-null    object 
 7   section                573 non-null    int64  
 8   Gradebook_Item         573 non-null    object 
 9   Grade_Category         0 non-null      float64
 10  FinalGradeCEMS         573 non-null    float64
 11  Points_Possible        573 non-null    int64  
 12  Points_Earned          486 non-null    float64
 13  Gender                 573 non-null    object 
 14  q1                     465 non-null    float64
 15  q2                     462 non-null    float64
 16  q3                     465 non-null    float64
 17  q4                     463 non-null    float64
 18  q5                     461 non-null    float64
 19  q6                     461 non-null    float64
 20  q7                     459 non-null    float64
 21  q8                     460 non-null    float64
 22  q9                     459 non-null    float64
 23  q10                    460 non-null    float64
 24  TimeSpent              573 non-null    float64
 25  TimeSpent_hours        573 non-null    float64
 26  TimeSpent_std          573 non-null    float64
 27  int                    503 non-null    float64
 28  pc                     504 non-null    float64
 29  uv                     504 non-null    float64
dtypes: float64(20), int64(5), object(5)
memory usage: 138.8+ KB

Substitute with column means

#substitute missing values with column means
mean_value_grade=sci_data['FinalGradeCEMS'].mean()
mean_value_time=sci_data['TimeSpent'].mean()

sci_data['FinalGradeCEMS'].fillna(value=mean_value_grade, inplace=True)
sci_data['TimeSpent'].fillna(value=mean_value_time, inplace=True)

/tmp/ipykernel_5274/3740110450.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sci_data['FinalGradeCEMS'].fillna(value=mean_value_grade, inplace=True)
/tmp/ipykernel_5274/3740110450.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sci_data['TimeSpent'].fillna(value=mean_value_time, inplace=True)

Filtering variables

Next, let’s explore filtering variables. Check out and run the next chunk of code, imagining that we wish to filter our data to view only the rows associated with students who earned a final grade (as a percentage) of 70% - or higher and the ‘TimeSpent’ associated with it.

sci_data['TimeSpent'][sci_data['FinalGradeCEMS']>70]

0      1555.1667
1      1382.7001
2       860.4335
3      1598.6166
4      1481.8000
         ...    
592     244.3835
593    2264.4834
595    2676.7501
598     817.4501
599    1638.4500
Name: TimeSpent, Length: 438, dtype: float64

👉 Your Turn ⤵

❓ How much time do you need to spend to get the grade higher than 70%? Is there a consistent pattern?

YOUR RESPONSE HERE

3. EXPLORE

Exploratory data analysis, or exploring your data, involves processes of describing your data (such as by calculating the means and standard deviations of numeric variables, or counting the frequency of categorical variables) and, often, visualizing your data. As we’ll learn in later labs, the explore phase can also involve the process of “feature engineering,” or creating new variables within a dataset [@krumm2018].

In this section, we’ll quickly pull together some basic stats and introduce you to a basic data visualization.

Summary Statistics

Let’s repurpose what we learned from our wrangle section to select just a few variables and quickly gather some descriptive statistics to see where the data is centered, its values to identify trends by using .describe() method.

sci_data.describe()

	student_id	total_points_possible	total_points_earned	percentage_earned	section	Grade_Category	FinalGradeCEMS	Points_Possible	Points_Earned	q1	...	q7	q8	q9	q10	TimeSpent	TimeSpent_hours	TimeSpent_std	int	pc	uv
count	573.000000	573.000000	573.000000	573.000000	573.000000	0.0	573.000000	573.000000	486.000000	465.000000	...	459.000000	460.000000	459.000000	460.000000	573.000000	573.000000	573.000000	503.000000	504.000000	504.000000
mean	86231.424084	4303.863874	3267.818499	0.758350	1.364747	NaN	77.202655	78.235602	69.417905	4.290323	...	3.910675	4.284783	3.492375	4.104348	1873.849317	31.230822	0.054687	4.215540	3.603009	3.730974
std	10392.690030	2305.719003	1824.242324	0.089552	0.675195	NaN	22.225076	169.019143	145.754144	0.685135	...	0.821833	0.688678	0.985399	0.937004	1335.608192	22.260137	0.985739	0.595033	0.644908	0.703211
min	43146.000000	840.000000	672.000000	0.466447	1.000000	NaN	0.000000	5.000000	0.000000	1.000000	...	1.000000	1.000000	1.000000	1.000000	0.700000	0.011667	-1.327782	2.000000	1.500000	1.000000
25%	85644.000000	2832.000000	2059.000000	0.704907	1.000000	NaN	71.251142	10.000000	7.600000	4.000000	...	3.000000	4.000000	3.000000	4.000000	943.066900	15.717782	-0.632272	3.900000	3.000000	3.333333
50%	88340.000000	3641.000000	2777.000000	0.776543	1.000000	NaN	84.569444	10.000000	10.000000	4.000000	...	4.000000	4.000000	4.000000	4.000000	1598.616600	26.643610	-0.148447	4.200000	3.500000	3.791667
75%	92733.000000	5083.000000	3882.000000	0.826181	2.000000	NaN	92.099323	30.000000	26.535000	5.000000	...	4.500000	5.000000	4.000000	5.000000	2454.333200	40.905553	0.483111	4.700000	4.000000	4.166667
max	97441.000000	15552.000000	12208.000000	0.910640	4.000000	NaN	100.000000	935.000000	828.200000	5.000000	...	5.000000	5.000000	5.000000	5.000000	8870.883300	147.848055	5.218815	5.000000	5.000000	5.000000

8 rows × 25 columns

👉 Your Turn ⤵

❓ What do you notice about this dataset? Which columns are most important for our research question? What do you wonder?

YOUR RESPONSE HERE

Data Visualization

Data visualization is an extremely common practice in Learning Analytics, especially in the use of data dashboards. Data visualization involves graphically representing one or more variables with the goal of discovering patterns in data. These patterns may help us to answer research questions or generate new questions about our data, to discover relationships between and among variables, and to create or select features for data modeling.

The Graphing Workflow

At it’s core, you can create some very simple but attractive graphs with just a couple lines of code. Matplotlib follows the common workflow for making graphs. To make a graph, you simply:

Start the graph with plt and include type of graph `.hist()’in our case, and add the data as an argument;
“Add” elements to the graph using the bins =, or changing the color;
Select variables to graph on each axis with the xlabel() argument.

Let’s give it a try by creating a simple histogram of our FinalGradeCEMS variable. The code below creates a histogram, or a distribution of the values, in this case for students’ final grades. Go ahead and run it:

plt.hist(sci_data['FinalGradeCEMS'], bins = 30, color ="skyblue")
plt.xlabel('FinalGradeCEMS')
plt.show()

👉 Your Turn ⤵

Now use the code chunk below to visualize the distribution of another variable in the data, specifically TimeSpent. You can do so by swapping out the variable FinalGradeCEMS with our new variable. Also, change the color to one of your choosing; consider this list of valid color names here: https://matplotlib.org/stable/gallery/color/named_colors.html

Tip: There is no shame in copying and pasting code from above. Remember, reproducible research is also intended to help you save time!

#Your code goes here:

Scatterplots

Finally, let’s create a scatter plot for the relationship between these two variables. Scatterplots are most useful for displaying the relationship between two continuous variables. Change type of graph by typing .scatter(). You can also choose the size of the marker and color.

👉 Your Turn ⤵

Complete the code chunk below to create a simple scatterplot with TimeSpent on the x axis and FinalGradeCEMS on the y axis.

plt.scatter(x = sci_data['TimeSpent'], y = sci_data['FinalGradeCEMS'], marker=".", color = '#88c999')
plt.xlabel('TimeSpent')
plt.ylabel('FinalGradeCEMS')
plt.show()

Well done! As you can see, there appears to be a positive relationship between the time students spend in the online course and their final grade!

4. MODEL

“Model” is one of those terms that has many different meanings. For our purpose, we refer to the process of simplifying and summarizing our data. Thus, models can take many forms; calculating means represents a legitimate form of modeling data, as does estimating more complex models, including linear regressions, and models and algorithms associated with machine learning tasks. For now, we’ll run a base linear regression model to further examine the relationship between TimeSpent and FinalGradeCEMS.

We’ll dive much deeper into modeling in subsequent learning labs, but for now let’s see if there is a statistically significant relationship between students’ final grades, FinaGradeCEMS, and the TimeSpent on the course.

We need to do some transformations of our two parameters before we can run the model, in our case it is linear regression. We encode the data from the two columns FinaGradeCEMS and TimeSpent into the format our scikit-learn linear regression model could understand.

*Note that on the y-axis we put a dependent variable, the one we want to predict and x-axis is for and independent, or exploratory variable.

We then train (or fit) or model to the data we selected to find a line of best fit.

We also added the code to visualize our prediction in the scatter plot and changed its color to understand better what is being predicted.

#dependent variable is what you want to predict - y axis; independent(exploratory) variable
X = np.array(sci_data['TimeSpent']).reshape(-1, 1)
y = np.array(sci_data['FinalGradeCEMS']).reshape(-1, 1)

reg = LinearRegression().fit(X, y)
plt.scatter(X, y, marker=".", color = '#88c999')
plt.plot(X, reg.predict(X),color='hotpink')
plt.xlabel('TimeSpent')
plt.ylabel('FinalGradeCEMS')
plt.show()

We can now use our model to predict the grade depending on the time spent studying.

print(reg.predict([[4000]]))

[[90.13094602]]

#| standalone: true

from shiny import *

app_ui = ui.page_fluid(
    ui.input_slider("n", "N", 0, 100, 40),
    ui.output_text_verbatim("txt"),
)

def server(input, output, session):
    @output
    @render.text
    def txt():
        return f"The value of n*2 is {input.n() * 2}"

app = App(app_ui, server)

Footnotes

We can see that there are totally 603 rows in each column and if we see that non-null count is lower, it means there are missing values.↩︎