In this project, I’ll demonstrate how to use the pandas package in Python for feature extraction. Feature extraction could be described as a method for selecting and/or combining variables into features, effectively reducing the amounnt of data that must be processed, while still accurately and completely describing the data set.
Here, I will work with data frames stemming from an e-learning platform. The task is to extract meaningful features of user activities from these data frames. These features will then serve as predictors in statistical models for predicting individual differences in cognitive performance. However, as the topic of this project specifically concerns the feature extraction process, the statistical analyses will be reported separately in an upcoming project.
The feature extraction procedure could briefly be divided into three phases which I’ll go through step-by-step in this project:
Load the required packages for importing data and preprocessing.
import pandas as pd # For data manipulation
import numpy as np # For data manipulation
from functools import reduce # For merging aggrated data frames together
Next, I’ll load the data frames. The data retrieved from the e-learning platform consists of multiple csv files, namely AnsMat.csv, RevMat.csv, ActMat.csv, and ReadMat.csv. Each csv file contains valuable information of online activities such as how much time the users have studying on the platform and how much many correct quizzes the users have made. The users have also been tested on their cognitive performance which is found in wm.xlsx.
AnsMat = pd.read_csv('AnsMat.csv', low_memory = False)
RevMat = pd.read_csv('RevMat.csv', low_memory = False)
ActMat = pd.read_csv('ActMat.csv', low_memory = False)
ReadMat = pd.read_csv('ReadMat.csv', low_memory = False)
wm = pd.read_excel('wm.xlsx')
Let’s take a look at the shape of the data frames using the shape
-command:
print(AnsMat.shape)
## (115631, 64)
print(RevMat.shape)
## (143354, 41)
print(ActMat.shape)
## (88442, 39)
print(ReadMat.shape)
## (1326017, 41)
print(wm.shape)
## (272, 13)
As is evident, the data frames comprises quite many columns, most of which are superfluous and not relevant for capturing meaningful online activities. Thus, the next task is to tidying the data so that the feature extraction process runs smoother
I’ll start with filtering those variables from RevMat that are relevant which I’ll merge with the AnsMat data frame using the pd.merge
command.
# Include only important variables from RevMat
RevMat = RevMat.filter(['userId', 'questionId', 'answerCorrect', 'questionSessionId',
'textMarkerId', 'userReviewDuration', 'userReviewFocus'])
# Merge AnsMat and RevMat
MatComb = pd.merge(AnsMat, RevMat, how = "inner", on=['userId', "questionSessionId", "questionId", 'textMarkerId'])
The column eventTime
in the data frames are formatted in unix timestamps in millisecons. For being able to work efficiently with date times, I’ll create a new column in the data frames using the pd.to_datetime
command.
MatComb['datetime'] = pd.to_datetime(MatComb['eventTime'], unit = 'ms')
ActMat['datetime'] = pd.to_datetime(ActMat['eventTime'], unit = 'ms')
ReadMat['datetime'] = pd.to_datetime(ReadMat['eventTime'], unit = 'ms')
It’s also important to compute different time variables for identifying login frequencies across users. Here, I create a global function titled timeVars
for appending the time variables to the data frames. The built-in lambda
function is a parsimonious way to retrieve these variables.
def timeVars(data):
'''This function computes several datetime variables and cut-off values for days spended'''
data['year'] = data['datetime'].apply(lambda x: x.year)
data['month'] = data['datetime'].apply(lambda x: x.month)
data['day'] = data['datetime'].apply(lambda x: x.day)
data['year_month_day'] = data['datetime'].apply(lambda x: str(x.year) + '-' + str(x.month) + '-' + str(x.day))
data['time_diff'] = data.groupby("userId")["datetime"].apply(lambda x: (x - x.iloc[0]))
data['time_diff_sec'] = data['time_diff'].dt.total_seconds()
data['cutoff_50days'] = data['time_diff_sec'].apply(lambda x: True if x <= 4320000 else False)
data['cutoff_100days'] = data['time_diff_sec'].apply(lambda x: True if x <= 8640000 else False)
Then I apply the timeVars
function on the data frames.
timeVars(MatComb)
timeVars(ActMat)
timeVars(ReadMat)
As the there was high inter-individual differences in how long the users had been studying on the platform, I decided to extract only data stemming from the first 100 days of practice for each unique user. Here, I use the query
function for implementing this operation.
MatComb = MatComb.query('cutoff_100days')
ReadMat = ReadMat.query('cutoff_100days')
ActMat = ActMat.query('cutoff_100days')
I’ll also compute a column depciting how many times a user have logged in to their account (titled reading_sessions
) and the total amount of quizzes the user have taken (titled learning_objects
). Lastly, I merge these columns in a data frame called Login
using pd.merge
.
# Retrieve how many login sessions the user has made
ReadingSessions = ReadMat.groupby('userId')['userSessionId'].nunique().reset_index().rename(columns = {'userSessionId':'reading_sessions_cut_off'})
# Retrieve how many learning objects the user has made
LearningObjects = MatComb.groupby('userId')['textMarkerId'].count().reset_index().rename(columns = {'textMarkerId':'learning_objects_cut_off'})
Login = pd.merge(ReadingSessions, LearningObjects, how = 'left', on = 'userId')
For leveling out those who only visited the e-learning platform from those that actually used the platform for studying, I’ll compute two cut-off criteria for being included. The first criteria states that users must have been completing > = 10 reading_sessions
during the first 100 days of practice, whereas the second criteria states that users must have been completing >= 50 learning_objects
during the first 100 days of practice. Here, I use the query
function for filtering out those not reaching these criterias.
Login = Login.query('reading_sessions_cut_off >= 10 and learning_objects_cut_off >= 50')
Altogether 18 online activity features was extracted from the e-learning platform which were heavily based on existing work (e.g., Jovanovic, Vukicevic, Milovanovic, & Minovic, 2012; Tortorella et al., 2015; Zacharis, 2015). Specifically, we divided these features into four broader subdomains of online activity, namely Quiz accuracy, Study frequency, Within-session reading activity, and Repetitive behaviour.
This code retrieves the following online activity variables:
# Both multiple choice and text
learningObjects_both = pd.DataFrame(MatComb.groupby('userId')['answerCorrect']
.value_counts(normalize = True)).rename(columns = {'answerCorrect':'lo_both_acc'}).reset_index().query('answerCorrect').drop(columns = {'answerCorrect'})
# Multiple choice
learningObjects_mc = pd.DataFrame(MatComb.query('questionType == "multiplechoice"').groupby('userId')['answerCorrect']
.value_counts(normalize = True)).rename(columns = {'answerCorrect':'lo_mc_accuracy'}).reset_index().query('answerCorrect').drop(columns = {'answerCorrect'})
# Text
learningObjects_txt = pd.DataFrame(MatComb.query('questionType == "text"').groupby('userId')['answerCorrect']
.value_counts(normalize = True)).rename(columns = {'answerCorrect':'lo_txt_accuracy'}).reset_index().query('answerCorrect').drop(columns = {'answerCorrect'})
# Compute aggregated data frame
learningObjects = (learningObjects_both, learningObjects_mc, learningObjects_txt)
learningObjects_df = reduce(lambda left, right: pd.merge(left, right, how = "left", on = "userId"), learningObjects)
This code retrieves the following online activity variables:
# Count total amount of reading sessions the students have made
readingSessions = ReadMat.groupby('userId')['userSessionId'].nunique().reset_index().rename(columns = {'userSessionId':'reading_sessions'})
# Count total amount of question sessions the students have made
questionSessions = ReadMat.groupby('userId')['questionSessionId'].nunique().reset_index().rename(columns = {'questionSessionId':'question_sessions'})
# Count total amount of learning objects the students have made
learningObjects = MatComb.groupby('userId')['textMarkerId'].count().reset_index().rename(columns = {'textMarkerId':"learning_objects"})
# Count total amount of video clips watched
videoClips = ActMat.query('action == "open_video"').groupby("userId")['action'].count().reset_index().rename(columns = {'action':'video_clips'})
# Compute aggregated data frame
studyFrequency = (readingSessions, questionSessions, learningObjects, videoClips)
studyFrequency_df = reduce(lambda left, right: pd.merge(left, right, how = 'left', on = "userId"), studyFrequency)
# Total time spended on reading
readingTime = ReadMat.query('resourceType == "chaptergroup"').groupby('userId')['focusTimeIncrement'].sum().reset_index().rename(columns = {'focusTimeIncrement':'reading_time'})
# Total time a student spent on the question before responding to the learning objcet
answerTime = MatComb.groupby('userId')['userAnswerFocus'].sum().reset_index().rename(columns = {'userAnswerFocus':'answer_time'})
# Total time a student spent reviewing the correct answers following a learning object
reviewTime = MatComb.groupby('userId')['userReviewFocus'].sum().reset_index().rename(columns = {'userReviewFocus': 'review_time'})
# Total time spent on reviewing and answering the learning objects
MatComb['question_time'] = (MatComb['userAnswerFocus'] + MatComb['userReviewFocus'])
questionTime = MatComb.groupby('userId')['question_time'].sum().reset_index()
# Compute aggregated data frame
readingTime_frequency = (readingTime, answerTime, reviewTime, questionTime)
readingTime_frequency_df = reduce(lambda left, right: pd.merge(left, right, how = 'left', on = 'userId'), readingTime_frequency)
This code retrieves the following online activity variables:
# This set of code retrieves variables taxing on average time spended
# Average time spended on reading during a learning session
readingTime_avg = ReadMat.query('resourceType == "chaptergroup"')
readingTime_avg = readingTime_avg.groupby(['userId', 'userSessionId'])['focusTimeIncrement'].sum().reset_index()
readingTime_avg['session_count'] = readingTime_avg.groupby('userId')['userSessionId'].transform('nunique')
readingTime_avg = readingTime_avg.groupby(['userId', 'session_count'])['focusTimeIncrement'].sum().reset_index()
readingTime_avg['reading_time_avg'] = readingTime_avg['focusTimeIncrement']/readingTime_avg['session_count']
readingTime_avg = readingTime_avg.filter(['userId', 'reading_time_avg'])
# Average time a student spent on reading the question before responding to the learning object
answerTime_avg = MatComb.groupby('userId')['userAnswerFocus'].mean().reset_index().rename(columns = {'userAnswerFocus': 'answer_time_avg'})
# Average time a student spent on reviewing the correct answers following a learning object
reviewTime_avg = MatComb.groupby('userId')['userReviewFocus'].mean().reset_index().rename(columns = {'userReviewFocus':'review_time_avg'})
# Average time spent on reviewing and answering the learning objects
MatComb['question_time_avg'] = (MatComb['userAnswerFocus'] + MatComb['userReviewFocus'])
questionTime_avg = MatComb.groupby('userId')['question_time_avg'].mean().reset_index()
# Compute aggregated data frame
withinSessionActivity = (readingTime_avg, questionTime_avg, reviewTime_avg, answerTime_avg)
withinSessionActivity_df = reduce(lambda left, right: pd.merge(left, right, how = 'left', on = 'userId'), withinSessionActivity)
This code retrieves the following online activity variables:
# Prepare dataset
MatCombRep = MatComb.filter(['userId', 'userSessionId', 'textMarkerId'])
# Count reps across
MatCombRep['rep_across'] = MatCombRep.groupby("userId")['textMarkerId'].apply(lambda x: x.duplicated())
MatCombRep['rep_across'] = MatCombRep['rep_across'].apply(lambda x: 1 if x == True else 0)
MatCombRep['rep_across'] = MatCombRep.groupby(["userId", "userSessionId"])['rep_across'].transform("sum")
# Count reps within
MatCombRep['rep_within'] = MatCombRep.groupby(["userId",'userSessionId'])['textMarkerId'].apply(lambda x: x.duplicated())
MatCombRep['rep_within'] = MatCombRep['rep_within'].apply(lambda x: 1 if x == True else 0)
MatCombRep['rep_within'] = MatCombRep.groupby(["userId", "userSessionId"])['rep_within'].transform("sum")
# Average number of repetitions made in a learning session
repetitionsAvg = MatCombRep.groupby(['userId','userSessionId'])['rep_within'].mean().reset_index()
repetitionsAvg['session_count'] = repetitionsAvg.groupby('userId')['userSessionId'].transform('nunique')
repetitionsAvg = repetitionsAvg.groupby(['userId', 'session_count'])['rep_within'].sum().reset_index()
repetitionsAvg['repetitions_avg'] = repetitionsAvg['rep_within']/repetitionsAvg['session_count']
repetitionsAvg = repetitionsAvg.filter(['userId', 'repetitions_avg'])
# Proportion of repetitions in relation to total amount of completed learning objects
repetitionsProp_reps = MatCombRep.groupby('userId')['rep_across'].max().reset_index()
repetitionsProp_LOs = MatCombRep.groupby('userId')['textMarkerId'].count().reset_index().filter(['textMarkerId'])
repetitionsProp = pd.concat([repetitionsProp_reps, repetitionsProp_LOs], axis= 1)
repetitionsProp['repetitions_prop'] = repetitionsProp['rep_across']/repetitionsProp['textMarkerId']
repetitionsProp = repetitionsProp.rename(columns = {'rep_across':'repetitions_count'})
repetitionsProp = repetitionsProp.filter(['userId', 'repetitions_prop','repetitions_count'])
# Compute aggregated data frame
repetitions = (repetitionsAvg, repetitionsProp)
repetitions_df = reduce(lambda left, right: pd.merge(left, right, how = 'left', on = 'userId'), repetitions)
The last task is to merge all unique data frames to a single aggregated data frame. I’ll start with the wm.xlsx file from which I filter out the important variables including the userId
key for joining the cognitive performance to online activities.
wm_tasks = wm.filter(['userId', 'rutan_level', 'digitspan_level', 'ospan_letter_list_correct'])
Next, I merge all data frames to each other using a left join. The reduce
function from the functools
package makes it very simple to merge multiple data frames simultaneously.
dfs = (Login, learningObjects_df, studyFrequency_df, readingTime_frequency_df, withinSessionActivity_df, repetitions_df, wm_tasks)
df = reduce(lambda left, right: pd.merge(left, right, how = "left", on = 'userId'), dfs)
Lastly, we print the head final data frame which will be used for further analysis.
pd.set_option('display.max_columns', 500)
df.head()
## userId reading_sessions_cut_off learning_objects_cut_off lo_both_acc \
## 0 642 12 88.0 0.784091
## 1 643 18 376.0 0.492021
## 2 677 73 640.0 0.596875
## 3 697 15 73.0 0.835616
## 4 743 31 769.0 0.810143
##
## lo_mc_accuracy lo_txt_accuracy reading_sessions question_sessions \
## 0 0.756757 0.803922 12 0
## 1 0.785714 0.455090 18 33
## 2 0.818182 0.580537 73 65
## 3 NaN 0.835616 15 1
## 4 0.758621 0.814346 31 26
##
## learning_objects video_clips reading_time answer_time review_time \
## 0 88.0 1.0 961513.0 3285498.0 1085923.0
## 1 376.0 5.0 16925035.0 10070643.0 5904319.0
## 2 640.0 1.0 63540804.0 16388984.0 5224244.0
## 3 73.0 NaN 810566.0 1648349.0 423585.0
## 4 769.0 6.0 14978968.0 18436372.0 6900600.0
##
## question_time reading_time_avg question_time_avg review_time_avg \
## 0 4371421.0 9.615130e+04 49675.238636 12340.034091
## 1 15974962.0 1.128336e+06 42486.601064 15702.976064
## 2 21613228.0 9.483702e+05 33770.668750 8162.881250
## 3 2071934.0 5.789757e+04 28382.657534 5802.534247
## 4 25336972.0 6.808622e+05 32947.947984 8973.472042
##
## answer_time_avg repetitions_avg repetitions_prop repetitions_count \
## 0 37335.204545 2.400000 0.181818 16.0
## 1 26783.625000 0.500000 0.077128 29.0
## 2 25607.787500 0.542857 0.073438 47.0
## 3 22580.123288 NaN NaN NaN
## 4 23974.475943 0.000000 0.063719 49.0
##
## rutan_level digitspan_level ospan_letter_list_correct
## 0 8 NaN NaN
## 1 6 4.0 48.0
## 2 7 NaN NaN
## 3 7 9.0 75.0
## 4 9 9.0 69.0
I hope that this tutorial gave you some insights of the benefits of extracting features into aggregated data frames using the pandas package in Python.