Jitter

Author

Xamantha Laos Cueva

Introduction and Basics

Introduction

The goal of this project was to gain insights into the jitter present in each message in a log to find out how often a message is sent on time relatively to its expected period. Jitter is defined as “the change in time it takes for a data packet to travel accross a network (Dialpad).” For example, a message can be made up of 4 data packets. Ideally, the 4 data packets arrive to their destination at the same time, but this does not always happen due to network disruptions. Thus, they end up arriving at irregular intervals, which makes up the jitter. Moreover, the jitter period is the maximum jitter deviation of any message’s data packet from their mean (NXP).

The two datasets used in this project were:

logdata.csv: a csv file with two columns - message_id and timestamp(ms). The column “message_id” is the unique identifier for a message and the column “timestamp (ms)” records the time in milliseconds that it took for that observation to arrive to its destination.
periods.csv: a csv file with two columns - message_id and period. The column “message_id” is the unique identifier for a message and the column “period” is the expected jitter period, measured in millisecond (ms).

Loading data and packages

#python packages
import pandas as pd

#loading data 
log_data = pd.read_csv("logdata.csv")
periods_data = pd.read_csv("periods.csv")

#r packages 
library(ggplot2)
library(reticulate)

Warning: package 'reticulate' was built under R version 4.1.2

#get rid of 
library(readr)

Warning: package 'readr' was built under R version 4.1.2

Understanding the data

#periods data
#checking dimensions
periods_data.shape

#checking number of unique message_id

(100, 2)

len(pd.unique(periods_data['message_id']))

#sorting by message_id

periods_data = periods_data.sort_values(by = 'period')
periods_data.head()

    message_id  period
94        1426       0
0           49      10
63         880      10
67         908      10
68         909      10

The dataset had 100 rows and 2 columns where the number of distinct message_id was 100. This means that message_id does not repeat in the dataset.
There is no information of the unit of period, so it was assumed that its unit was “ms (milisecond)” since the timestamps in the dataset “log_data” is measured in ms.

#log_data
#checking dimensions 
log_data.shape

#checking number of unique message_id

(350243, 2)

len(pd.unique(log_data['message_id']))

#sorting by timestamp

log_data = log_data.sort_values(by = 'timestamp (ms)')
log_data.head()

   timestamp (ms)  message_id
0           0.243         341
1           0.329         356
2           0.585         389
3           0.841         390
4           0.969         907

The dataset had 350243 rows and 2 columns where the number of distinct message_id was 102. This means that message_id repeated several times in the dataset.
The timestamp is in ms (miliseconds). Because the message_id’s repeated, it was assumed that the set up of the study was that the number of times that the same message_id was repeated in the dataset represented each individual data packet that made up the the message.
Sorting the dataset by timestamp showed that the entries are valid since they begin at 0.243.

EDA

Periods data

The histogram below shows that the periods ranges from 0 to 1000, with the most common period being 100 ms.

#histogram to visualize distribution of period 
#log_data = read.csv("logdata.csv")
#periods_data = read.csv("periods.csv")
ggplot(periods_data, aes(x = period)) +
  geom_histogram(bins=30) +
  scale_x_continuous(breaks = seq(0,1000,100)) +
  labs(x = "period (ms)", title = "Histogram of the Jitter period expected for a message arrival") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

Log data

In the original dataset, the mean of the arrival time across all packets of all messages is 52,570.47 milliseconds and the median was 52,569. Moreover, two sub-datasets were created where the mean and the median arrival time across all data packets of a given message was calculated. The mean and median of those data sets was very similar to that of the original dataset at 53,059.25 and 53,058.24 milliseconds.

#original data
log_data.describe()['timestamp (ms)']

count    350243.000000
mean      52570.470550
std       30350.411599
min           0.243000
25%       26287.875000
50%       52569.507000
75%       78854.328000
max      105136.360000
Name: timestamp (ms), dtype: float64

#log_data grouped by messsage_id mean 
log_data_mean = log_data.groupby(['message_id']).mean()

#log_Dasta grouped by message_id median
log_data_median = log_data.groupby(['message_id']).median()

#merging mean and median 
log_data_merged = pd.merge(log_data_mean, log_data_median, on="message_id")
log_data_merged.reset_index(inplace=True)
log_data_merged.columns = ['message_id','mean','median']
log_data_merged[['mean','median']].describe()

                mean         median
count     102.000000     102.000000
mean    53059.249968   53058.246819
std      4980.251510    4980.502892
min     52314.690314   52056.448000
25%     52563.454776   52562.914750
50%     52569.366662   52569.592250
75%     52584.622755   52587.448000
max    102855.610250  102855.599500

#number of data packets per message 
log_data_packets = log_data.groupby(['message_id']).count()
log_data_packets.reset_index(inplace=True)
log_data_packets.columns = ['message_id','number_packets']

Moreover, as shown in the plot below, most messages are made up of less than 2500 data packets, with the highest number of data packets being 10,000 and the lowest being 4.

ggplot(log_data_packets, aes(x=number_packets)) + 
  geom_density(fill = "#192841", alpha = 0.5) +
  theme_minimal() +
  labs(x = "number of data packets", 
       title = "Density of the number of data packets per message") +
  theme(plot.title = element_text(face = "bold"))

Both

The dataset “log_data_general_metrics” was created with the columns to outline the mean and median time it takes for across all data packets of a given message to arrive to their destination, their number of packets, and their expected jitter period.

#putting general metrics about a particular message together 
log_data_metrics_packets = log_data_merged.merge(log_data_packets, 
                                                 on='message_id', 
                                                 how='left')
                                                 

log_data_general_metrics = log_data_metrics_packets.merge(periods_data, 
                                                          on = "message_id", 
                                                          how = "left")
log_data_general_metrics.head()

   message_id          mean      median  number_packets  period
0          49  52567.447523  52567.6510           10514    10.0
1          81  52568.711934  52568.7600           10514    10.0
2         257  52570.002043  52570.3815           10514    10.0
3         258  52576.545720  52576.7810            1052   100.0
4         259  52534.996414  52535.5020            1051   100.0

Three columns were added to the dataset “log_period_general” to hold:

difference: The difference between the time it took for a data packet to arrive to its destination relatively to the mean arrival time of all data packets of a given message. Its absolute value would be the period. However, the raw difference was calculated such that some analysis regarding how many data packets arrived earlier/later than their mean could be calculated later.
actual_period: How earlier/later than the time arrival mean of all data packets across a message a specific data packet arrived. It is important to highlight that the reason why the actual_period is expressed as an absolute value is that, differently to what might be believed first, having a data packet arrive much earlier that their mean is not better than having it arrive later since both mean that there are significant gaps across the times when data packet arrived. Those gaps are what make up disruptions in communication.
off: The difference between the actual period and the expected period of the data packet. An off value equal to zero or less is positive since that means that the data packet arrived within the expected period (e.g. if the actual period of a data packet is 5 and its expected period was 10, then it arrived within expectations and the “off” value is -5).

#merging log_data_general_metrics with the original log_data 
log_period_general = log_data.merge(log_data_general_metrics, 
                                    on = "message_id",
                                    how = "left").sort_values(by = 'message_id')
#reordering columns
log_period_general = log_period_general.reindex(columns=['message_id','timestamp (ms)', 'mean',
                                                          'median','number_packets', 'period'])
                                                          
#create column to hold the difference of the arrival of each data packet relatively to their mean
#a positive difference means that the message arrive earlier relatively to the mean of the data packets of a given message 
log_period_general['difference'] = log_period_general['timestamp (ms)'] - log_period_general['mean']

#create column to hold how off the actual period jitter of each data packet 
log_period_general['actual_period'] = abs(log_period_general['difference'])

#create column to hold how off the actual period is from the expected period
log_period_general['off'] = log_period_general['actual_period'] - log_period_general['period']

#overview of dataset
log_period_general.head()

        message_id  timestamp (ms)  ...  actual_period           off
219511          49       65892.960  ...   13325.512477  13315.512477
117229          49       35192.227  ...   17375.220523  17365.220523
243916          49       73222.964  ...   20655.516477  20645.516477
182320          49       54732.758  ...    2165.310477   2155.310477
336231          49      100933.020  ...   48365.572477  48355.572477

[5 rows x 9 columns]

There were 207 data packets that were delivered on time. Given that there was a total 350,243 data packets across all messages, this means that only ~0.06% of data packets arrived within the expected Jitter period. All those data packets represent 99 messages. This means that 99% of messages in the log had at least one of their data packets arrived on time.

#subset of observations within expectation 
within_expectation = log_period_general[log_period_general['off'] <= 0]

#number of observations within expectation
within_expectation.shape[0]

#number of uniques messages in the subset of observations within expectation

len(within_expectation['message_id'].unique())

A dataset containing only those messages that had at least one packet arriving during the expected jitter period was creted with two columns: the number of its packets that arrived on time and the number of total packets that made up the message. As the table below shows, on average, out of all messages that had at least one packet arriving during the expected jitter period, only 0.4% of the message’ packets arrived on time, with the median at 0.1%. In the best case, 20% of the packets of a message arrived within the period.

#getting the message_ids that had at least one packet on time
within_expectation_id = within_expectation.iloc[:,0:1]
#counting how many packets of a given message arrived on time 
within_expectation_packets = within_expectation_id.groupby('message_id').size().reset_index()
within_expectation_packets.columns = ['message_id', 'packets_within_period']

#merging it to get the column of their total number of packets
within_expectation_packets = within_expectation_packets.merge(log_data_packets, 
                                                              on = 'message_id', 
                                                              how = 'left')
                                                              
#getting the ratio of number of packets on time and total number of packets 
within_expectation_packets['ratio_packets'] = within_expectation_packets['packets_within_period']/within_expectation_packets['number_packets']

#retrieving metrics
within_expectation_packets.describe()

        message_id  packets_within_period  number_packets  ratio_packets
count    99.000000              99.000000       99.000000      99.000000
mean    725.232323               2.090909     3325.363636       0.004211
std     384.387714               0.656069     3769.657941       0.006678
min      49.000000               1.000000      105.000000       0.000190
25%     389.500000               2.000000     1051.000000       0.000380
50%     680.000000               2.000000     1052.000000       0.001901
75%     924.000000               2.000000     5257.000000       0.001903
max    1869.000000               8.000000    10515.000000       0.028571

Given those results, one question was whether there was any relationships between the actual jitter period and the expected jitter period. The boxplot below shows that there is not. The only exception is when the expected Jitter period was 0 (all data packets of the message arrived at the same time, almost the same time - early/delay canceled out). In that case, the actual period was low as well.

ggplot(log_period_general, aes(x = period, 
                               y = actual_period)) +
geom_boxplot(fill = "#192841", alpha = 0.5) +
  theme_minimal() +
  labs(x = 'expected period (ms)', 
       y = 'actual period (ms)', 
       title = 'Actual Jitter period vs Expected Jitter period') +
  theme(plot.title = element_text(face = "bold"))

Finally, given the striking results it was wondered whether there is any correlation between the number of data packets and the average actual period of message. The reason this query was relevant was because, if there is a correlation, then that could lead the study to wonder whether the statement: “The more data packets a message has, the higher the Jitter period” is true. And, if that was proven, then a proposed solution could be to allocate a higher Jitter period to those observation with higher data packets so the difference between the expected and actual Jitter period shrinks.

To test this, the average actual Jitter period per message was calculated and then its correlation with its number of data packets was computed. As shown below, the correlation coefficient did not even reach 0.01 so the hypothesis was discarded.

#extracting relevant columns
log_period_general_actual = log_period_general[['message_id', 'actual_period']]

#getting the mean of the actual jitter period for a message
log_period_general_actual = log_period_general_actual.groupby(['message_id']).mean()
log_period_general_actual.reset_index(inplace=True)

#merging it with the total number of packets
log_period_general_actual = log_period_general_actual.merge(log_data_packets, on = 'message_id', how = 'left')

#computing the correlation 
pearson_correlation = log_period_general_actual['actual_period'].corr(log_period_general_actual['number_packets'])
pearson_correlation

0.09057550188040804

Conclusions

The study reached the following conclusions:

Only ~0.06% of data packets arrived within the expected Jitter period.
99% of messages in the log had at least one of their data packets arrived on time.
The maximum percentage of data packets that any message had that arrived on time was 20%. The mean was 0.4%. This means that even though 99% of messages have data packets arriving on time, those data packets make up only 0.4% of the message in average.
The number of data packets a message has does not correlate with its Jitter period.