Feature extraction using Pandas

Summary
Loading files and data preprocessing
Feature extraction
Data aggregation

Summary

In this project, I’ll demonstrate how to use the pandas package in Python for feature extraction. Feature extraction could be described as a method for selecting and/or combining variables into features, effectively reducing the amounnt of data that must be processed, while still accurately and completely describing the data set.

Here, I will work with data frames stemming from an e-learning platform. The task is to extract meaningful features of user activities from these data frames. These features will then serve as predictors in statistical models for predicting individual differences in cognitive performance. However, as the topic of this project specifically concerns the feature extraction process, the statistical analyses will be reported separately in an upcoming project.

The feature extraction procedure could briefly be divided into three phases which I’ll go through step-by-step in this project:

Loading files and data preprocessing
Feature extraction
Data aggregation

Loading files and data preprocessing

Load the required packages for importing data and preprocessing.

import pandas as pd # For data manipulation
import numpy as np  # For data manipulation
from functools import reduce # For merging aggrated data frames together

Next, I’ll load the data frames. The data retrieved from the e-learning platform consists of multiple csv files, namely AnsMat.csv, RevMat.csv, ActMat.csv, and ReadMat.csv. Each csv file contains valuable information of online activities such as how much time the users have studying on the platform and how much many correct quizzes the users have made. The users have also been tested on their cognitive performance which is found in wm.xlsx.

AnsMat = pd.read_csv('AnsMat.csv', low_memory = False)
RevMat = pd.read_csv('RevMat.csv', low_memory = False)
ActMat = pd.read_csv('ActMat.csv', low_memory = False)
ReadMat = pd.read_csv('ReadMat.csv', low_memory = False)
wm = pd.read_excel('wm.xlsx')

Let’s take a look at the shape of the data frames using the shape-command:

print(AnsMat.shape)

## (115631, 64)

print(RevMat.shape)

## (143354, 41)

print(ActMat.shape)

## (88442, 39)

print(ReadMat.shape)

## (1326017, 41)

print(wm.shape)

## (272, 13)

As is evident, the data frames comprises quite many columns, most of which are superfluous and not relevant for capturing meaningful online activities. Thus, the next task is to tidying the data so that the feature extraction process runs smoother

I’ll start with filtering those variables from RevMat that are relevant which I’ll merge with the AnsMat data frame using the pd.merge command.

# Include only important variables from RevMat
RevMat = RevMat.filter(['userId', 'questionId', 'answerCorrect', 'questionSessionId',
                         'textMarkerId', 'userReviewDuration', 'userReviewFocus'])
                         # Merge AnsMat and RevMat
MatComb = pd.merge(AnsMat, RevMat, how = "inner", on=['userId', "questionSessionId", "questionId",                                                                         'textMarkerId'])

The column eventTime in the data frames are formatted in unix timestamps in millisecons. For being able to work efficiently with date times, I’ll create a new column in the data frames using the pd.to_datetime command.

MatComb['datetime'] = pd.to_datetime(MatComb['eventTime'], unit = 'ms')
ActMat['datetime'] = pd.to_datetime(ActMat['eventTime'], unit = 'ms')
ReadMat['datetime'] = pd.to_datetime(ReadMat['eventTime'], unit = 'ms')

It’s also important to compute different time variables for identifying login frequencies across users. Here, I create a global function titled timeVars for appending the time variables to the data frames. The built-in lambda function is a parsimonious way to retrieve these variables.

def timeVars(data):
    '''This function computes several datetime variables and cut-off values for days spended'''
    data['year'] = data['datetime'].apply(lambda x: x.year)
    data['month'] = data['datetime'].apply(lambda x: x.month)
    data['day'] = data['datetime'].apply(lambda x: x.day)
    data['year_month_day'] = data['datetime'].apply(lambda x: str(x.year) + '-' + str(x.month) + '-' + str(x.day))
    data['time_diff'] = data.groupby("userId")["datetime"].apply(lambda x: (x - x.iloc[0]))
    data['time_diff_sec'] = data['time_diff'].dt.total_seconds()
    data['cutoff_50days'] = data['time_diff_sec'].apply(lambda x: True if x <= 4320000 else False)
    data['cutoff_100days'] = data['time_diff_sec'].apply(lambda x: True if x <= 8640000 else False)

Then I apply the timeVars function on the data frames.

timeVars(MatComb)
timeVars(ActMat)
timeVars(ReadMat)

As the there was high inter-individual differences in how long the users had been studying on the platform, I decided to extract only data stemming from the first 100 days of practice for each unique user. Here, I use the query function for implementing this operation.

MatComb = MatComb.query('cutoff_100days')
ReadMat = ReadMat.query('cutoff_100days')
ActMat = ActMat.query('cutoff_100days')

I’ll also compute a column depciting how many times a user have logged in to their account (titled reading_sessions) and the total amount of quizzes the user have taken (titled learning_objects). Lastly, I merge these columns in a data frame called Login using pd.merge.

# Retrieve how many login sessions the user has made
ReadingSessions = ReadMat.groupby('userId')['userSessionId'].nunique().reset_index().rename(columns = {'userSessionId':'reading_sessions_cut_off'})

# Retrieve how many learning objects the user has made
LearningObjects = MatComb.groupby('userId')['textMarkerId'].count().reset_index().rename(columns = {'textMarkerId':'learning_objects_cut_off'})
Login = pd.merge(ReadingSessions, LearningObjects, how = 'left', on = 'userId')

For leveling out those who only visited the e-learning platform from those that actually used the platform for studying, I’ll compute two cut-off criteria for being included. The first criteria states that users must have been completing > = 10 reading_sessions during the first 100 days of practice, whereas the second criteria states that users must have been completing >= 50 learning_objects during the first 100 days of practice. Here, I use the query function for filtering out those not reaching these criterias.

Login = Login.query('reading_sessions_cut_off >= 10 and learning_objects_cut_off >= 50')

Feature extraction

Altogether 18 online activity features was extracted from the e-learning platform which were heavily based on existing work (e.g., Jovanovic, Vukicevic, Milovanovic, & Minovic, 2012; Tortorella et al., 2015; Zacharis, 2015). Specifically, we divided these features into four broader subdomains of online activity, namely Quiz accuracy, Study frequency, Within-session reading activity, and Repetitive behaviour.

Quiz accuracy

This code retrieves the following online activity variables:

Proportion of correctly recalled quizzes in relation to total amount of completed quizzes.
Proportion of correctly recalled quizzes (multiple-choice questions) in relation to total amount of completed quizzes.
Proportion of correctly recalled quizzes (open-ended questions) in relation to total amount of completed quizzes.

# Both multiple choice and text
learningObjects_both = pd.DataFrame(MatComb.groupby('userId')['answerCorrect']
                               .value_counts(normalize = True)).rename(columns = {'answerCorrect':'lo_both_acc'}).reset_index().query('answerCorrect').drop(columns = {'answerCorrect'})

# Multiple choice
learningObjects_mc = pd.DataFrame(MatComb.query('questionType == "multiplechoice"').groupby('userId')['answerCorrect']
                               .value_counts(normalize = True)).rename(columns = {'answerCorrect':'lo_mc_accuracy'}).reset_index().query('answerCorrect').drop(columns = {'answerCorrect'})

# Text
learningObjects_txt = pd.DataFrame(MatComb.query('questionType == "text"').groupby('userId')['answerCorrect']
                               .value_counts(normalize = True)).rename(columns = {'answerCorrect':'lo_txt_accuracy'}).reset_index().query('answerCorrect').drop(columns = {'answerCorrect'})

# Compute aggregated data frame
learningObjects = (learningObjects_both, learningObjects_mc, learningObjects_txt)
learningObjects_df = reduce(lambda left, right: pd.merge(left, right, how = "left", on = "userId"), learningObjects)

Study frequency

This code retrieves the following online activity variables:

Total number of reading sessions participants have made.
Total number of question sessions participants have made.
Total number of quizzes participants have made.
Total number of video clips watched.
Total time spent on reviewing and answering the quizzes.
Total time of reading the material.
Total time a student spent on reading the question before responding to the quizzes.
Total time a student spent on reviewing the correct answers following the quizzes.

# Count total amount of reading sessions the students have made
readingSessions = ReadMat.groupby('userId')['userSessionId'].nunique().reset_index().rename(columns = {'userSessionId':'reading_sessions'})

# Count total amount of question sessions the students have made
questionSessions = ReadMat.groupby('userId')['questionSessionId'].nunique().reset_index().rename(columns = {'questionSessionId':'question_sessions'})

# Count total amount of learning objects the students have made
learningObjects = MatComb.groupby('userId')['textMarkerId'].count().reset_index().rename(columns = {'textMarkerId':"learning_objects"})

# Count total amount of video clips watched
videoClips = ActMat.query('action == "open_video"').groupby("userId")['action'].count().reset_index().rename(columns = {'action':'video_clips'})

# Compute aggregated data frame
studyFrequency = (readingSessions, questionSessions, learningObjects, videoClips)
studyFrequency_df = reduce(lambda left, right: pd.merge(left, right, how = 'left', on = "userId"), studyFrequency)

# Total time spended on reading
readingTime = ReadMat.query('resourceType == "chaptergroup"').groupby('userId')['focusTimeIncrement'].sum().reset_index().rename(columns = {'focusTimeIncrement':'reading_time'})

# Total time a student spent on the question before responding to the learning objcet
answerTime = MatComb.groupby('userId')['userAnswerFocus'].sum().reset_index().rename(columns = {'userAnswerFocus':'answer_time'})

# Total time a student spent reviewing the correct answers following a learning object
reviewTime = MatComb.groupby('userId')['userReviewFocus'].sum().reset_index().rename(columns = {'userReviewFocus': 'review_time'})

# Total time spent on reviewing and answering the learning objects
MatComb['question_time'] = (MatComb['userAnswerFocus'] + MatComb['userReviewFocus'])
questionTime = MatComb.groupby('userId')['question_time'].sum().reset_index()

# Compute aggregated data frame
readingTime_frequency = (readingTime, answerTime, reviewTime, questionTime)
readingTime_frequency_df = reduce(lambda left, right: pd.merge(left, right, how = 'left', on = 'userId'), readingTime_frequency)

Within-session reading activity

This code retrieves the following online activity variables:

Average reading time per session.
Average time spent on reviewing and answering the quizzes per session.
Average time a student spent on reading the question before responding to the quizzes per session.
Average time a student spent on reviewing the correct answers following a quizzes per session.

# This set of code retrieves variables taxing on average time spended 

# Average time spended on reading during a learning session
readingTime_avg = ReadMat.query('resourceType == "chaptergroup"')
readingTime_avg = readingTime_avg.groupby(['userId', 'userSessionId'])['focusTimeIncrement'].sum().reset_index()
readingTime_avg['session_count'] = readingTime_avg.groupby('userId')['userSessionId'].transform('nunique')
readingTime_avg = readingTime_avg.groupby(['userId', 'session_count'])['focusTimeIncrement'].sum().reset_index()
readingTime_avg['reading_time_avg'] = readingTime_avg['focusTimeIncrement']/readingTime_avg['session_count']
readingTime_avg = readingTime_avg.filter(['userId', 'reading_time_avg'])

# Average time a student spent on reading the question before responding to the learning object
answerTime_avg = MatComb.groupby('userId')['userAnswerFocus'].mean().reset_index().rename(columns = {'userAnswerFocus': 'answer_time_avg'})

# Average time a student spent on reviewing the correct answers following a learning object
reviewTime_avg = MatComb.groupby('userId')['userReviewFocus'].mean().reset_index().rename(columns = {'userReviewFocus':'review_time_avg'})

# Average time spent on reviewing and answering the learning objects
MatComb['question_time_avg'] = (MatComb['userAnswerFocus'] + MatComb['userReviewFocus'])
questionTime_avg = MatComb.groupby('userId')['question_time_avg'].mean().reset_index()

# Compute aggregated data frame
withinSessionActivity = (readingTime_avg, questionTime_avg, reviewTime_avg, answerTime_avg)
withinSessionActivity_df = reduce(lambda left, right: pd.merge(left, right, how = 'left', on = 'userId'), withinSessionActivity)

Repetitive behavior

This code retrieves the following online activity variables:

Average number of repetitions quizzes made in a learning session.
Proportion of repetitions in relation to the total amount of completed quizzes.
Number of times the student went back and repeated earlier completed quizzes.

# Prepare dataset
MatCombRep = MatComb.filter(['userId', 'userSessionId', 'textMarkerId'])
# Count reps across
MatCombRep['rep_across'] = MatCombRep.groupby("userId")['textMarkerId'].apply(lambda x: x.duplicated())
MatCombRep['rep_across'] = MatCombRep['rep_across'].apply(lambda x: 1 if x == True else 0)
MatCombRep['rep_across'] = MatCombRep.groupby(["userId", "userSessionId"])['rep_across'].transform("sum")

# Count reps within 
MatCombRep['rep_within'] = MatCombRep.groupby(["userId",'userSessionId'])['textMarkerId'].apply(lambda x: x.duplicated())
MatCombRep['rep_within'] = MatCombRep['rep_within'].apply(lambda x: 1 if x == True else 0)
MatCombRep['rep_within'] = MatCombRep.groupby(["userId", "userSessionId"])['rep_within'].transform("sum")

# Average number of repetitions made in a learning session
repetitionsAvg = MatCombRep.groupby(['userId','userSessionId'])['rep_within'].mean().reset_index()
repetitionsAvg['session_count'] = repetitionsAvg.groupby('userId')['userSessionId'].transform('nunique')
repetitionsAvg = repetitionsAvg.groupby(['userId', 'session_count'])['rep_within'].sum().reset_index()
repetitionsAvg['repetitions_avg'] = repetitionsAvg['rep_within']/repetitionsAvg['session_count']
repetitionsAvg = repetitionsAvg.filter(['userId', 'repetitions_avg'])

# Proportion of repetitions in relation to total amount of completed learning objects
repetitionsProp_reps = MatCombRep.groupby('userId')['rep_across'].max().reset_index()
repetitionsProp_LOs = MatCombRep.groupby('userId')['textMarkerId'].count().reset_index().filter(['textMarkerId'])
repetitionsProp = pd.concat([repetitionsProp_reps, repetitionsProp_LOs], axis= 1)
repetitionsProp['repetitions_prop'] = repetitionsProp['rep_across']/repetitionsProp['textMarkerId']
repetitionsProp = repetitionsProp.rename(columns = {'rep_across':'repetitions_count'})
repetitionsProp = repetitionsProp.filter(['userId', 'repetitions_prop','repetitions_count'])

# Compute aggregated data frame
repetitions = (repetitionsAvg, repetitionsProp)
repetitions_df = reduce(lambda left, right: pd.merge(left, right, how = 'left', on = 'userId'), repetitions)

Data aggregation

The last task is to merge all unique data frames to a single aggregated data frame. I’ll start with the wm.xlsx file from which I filter out the important variables including the userId key for joining the cognitive performance to online activities.

wm_tasks = wm.filter(['userId', 'rutan_level', 'digitspan_level', 'ospan_letter_list_correct'])

Next, I merge all data frames to each other using a left join. The reduce function from the functools package makes it very simple to merge multiple data frames simultaneously.

dfs = (Login, learningObjects_df, studyFrequency_df, readingTime_frequency_df, withinSessionActivity_df, repetitions_df, wm_tasks)
df = reduce(lambda left, right: pd.merge(left, right, how = "left", on = 'userId'), dfs)

Lastly, we print the head final data frame which will be used for further analysis.

pd.set_option('display.max_columns', 500)
df.head()

##    userId  reading_sessions_cut_off  learning_objects_cut_off  lo_both_acc  \
## 0     642                        12                      88.0     0.784091   
## 1     643                        18                     376.0     0.492021   
## 2     677                        73                     640.0     0.596875   
## 3     697                        15                      73.0     0.835616   
## 4     743                        31                     769.0     0.810143   
## 
##    lo_mc_accuracy  lo_txt_accuracy  reading_sessions  question_sessions  \
## 0        0.756757         0.803922                12                  0   
## 1        0.785714         0.455090                18                 33   
## 2        0.818182         0.580537                73                 65   
## 3             NaN         0.835616                15                  1   
## 4        0.758621         0.814346                31                 26   
## 
##    learning_objects  video_clips  reading_time  answer_time  review_time  \
## 0              88.0          1.0      961513.0    3285498.0    1085923.0   
## 1             376.0          5.0    16925035.0   10070643.0    5904319.0   
## 2             640.0          1.0    63540804.0   16388984.0    5224244.0   
## 3              73.0          NaN      810566.0    1648349.0     423585.0   
## 4             769.0          6.0    14978968.0   18436372.0    6900600.0   
## 
##    question_time  reading_time_avg  question_time_avg  review_time_avg  \
## 0      4371421.0      9.615130e+04       49675.238636     12340.034091   
## 1     15974962.0      1.128336e+06       42486.601064     15702.976064   
## 2     21613228.0      9.483702e+05       33770.668750      8162.881250   
## 3      2071934.0      5.789757e+04       28382.657534      5802.534247   
## 4     25336972.0      6.808622e+05       32947.947984      8973.472042   
## 
##    answer_time_avg  repetitions_avg  repetitions_prop  repetitions_count  \
## 0     37335.204545         2.400000          0.181818               16.0   
## 1     26783.625000         0.500000          0.077128               29.0   
## 2     25607.787500         0.542857          0.073438               47.0   
## 3     22580.123288              NaN               NaN                NaN   
## 4     23974.475943         0.000000          0.063719               49.0   
## 
##    rutan_level  digitspan_level  ospan_letter_list_correct  
## 0            8              NaN                        NaN  
## 1            6              4.0                       48.0  
## 2            7              NaN                        NaN  
## 3            7              9.0                       75.0  
## 4            9              9.0                       69.0

I hope that this tutorial gave you some insights of the benefits of extracting features into aggregated data frames using the pandas package in Python.

Feature extraction using Pandas

Daniel Fellman

14 July 2019

Summary

Loading files and data preprocessing

Feature extraction

Quiz accuracy

Study frequency

Within-session reading activity

Repetitive behavior

Data aggregation