#python packages
import pandas as pd
#loading data
log_data = pd.read_csv("logdata.csv")
periods_data = pd.read_csv("periods.csv")Jitter
Introduction and Basics
Introduction
The goal of this project was to gain insights into the jitter present in each message in a log to find out how often a message is sent on time relatively to its expected period. Jitter is defined as “the change in time it takes for a data packet to travel accross a network (Dialpad).” For example, a message can be made up of 4 data packets. Ideally, the 4 data packets arrive to their destination at the same time, but this does not always happen due to network disruptions. Thus, they end up arriving at irregular intervals, which makes up the jitter. Moreover, the jitter period is the maximum jitter deviation of any message’s data packet from their mean (NXP).
The two datasets used in this project were:
logdata.csv: a csv file with two columns -message_idandtimestamp(ms).The column “message_id” is the unique identifier for a message and the column “timestamp (ms)” records the time in milliseconds that it took for that observation to arrive to its destination.periods.csv: a csv file with two columns -message_idandperiod. The column “message_id” is the unique identifier for a message and the column “period” is the expected jitter period, measured in millisecond (ms).
Loading data and packages
#r packages
library(ggplot2)
library(reticulate)Warning: package 'reticulate' was built under R version 4.1.2
#get rid of
library(readr)Warning: package 'readr' was built under R version 4.1.2
Understanding the data
#periods data
#checking dimensions
periods_data.shape
#checking number of unique message_id(100, 2)
len(pd.unique(periods_data['message_id']))
#sorting by message_id100
periods_data = periods_data.sort_values(by = 'period')
periods_data.head() message_id period
94 1426 0
0 49 10
63 880 10
67 908 10
68 909 10
The dataset had 100 rows and 2 columns where the number of distinct
message_idwas 100. This means thatmessage_iddoes not repeat in the dataset.There is no information of the unit of period, so it was assumed that its unit was “ms (milisecond)” since the timestamps in the dataset “log_data” is measured in ms.
#log_data
#checking dimensions
log_data.shape
#checking number of unique message_id(350243, 2)
len(pd.unique(log_data['message_id']))
#sorting by timestamp102
log_data = log_data.sort_values(by = 'timestamp (ms)')
log_data.head() timestamp (ms) message_id
0 0.243 341
1 0.329 356
2 0.585 389
3 0.841 390
4 0.969 907
The dataset had 350243 rows and 2 columns where the number of distinct
message_idwas 102. This means thatmessage_idrepeated several times in the dataset.The timestamp is in ms (miliseconds). Because the
message_id’s repeated, it was assumed that the set up of the study was that the number of times that the same message_id was repeated in the dataset represented each individual data packet that made up the the message.Sorting the dataset by
timestampshowed that the entries are valid since they begin at 0.243.
EDA
Periods data
The histogram below shows that the periods ranges from 0 to 1000, with the most common period being 100 ms.
#histogram to visualize distribution of period
#log_data = read.csv("logdata.csv")
#periods_data = read.csv("periods.csv")
ggplot(periods_data, aes(x = period)) +
geom_histogram(bins=30) +
scale_x_continuous(breaks = seq(0,1000,100)) +
labs(x = "period (ms)", title = "Histogram of the Jitter period expected for a message arrival") +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
Log data
In the original dataset, the mean of the arrival time across all packets of all messages is 52,570.47 milliseconds and the median was 52,569. Moreover, two sub-datasets were created where the mean and the median arrival time across all data packets of a given message was calculated. The mean and median of those data sets was very similar to that of the original dataset at 53,059.25 and 53,058.24 milliseconds.
#original data
log_data.describe()['timestamp (ms)']count 350243.000000
mean 52570.470550
std 30350.411599
min 0.243000
25% 26287.875000
50% 52569.507000
75% 78854.328000
max 105136.360000
Name: timestamp (ms), dtype: float64
#log_data grouped by messsage_id mean
log_data_mean = log_data.groupby(['message_id']).mean()
#log_Dasta grouped by message_id median
log_data_median = log_data.groupby(['message_id']).median()
#merging mean and median
log_data_merged = pd.merge(log_data_mean, log_data_median, on="message_id")
log_data_merged.reset_index(inplace=True)
log_data_merged.columns = ['message_id','mean','median']
log_data_merged[['mean','median']].describe() mean median
count 102.000000 102.000000
mean 53059.249968 53058.246819
std 4980.251510 4980.502892
min 52314.690314 52056.448000
25% 52563.454776 52562.914750
50% 52569.366662 52569.592250
75% 52584.622755 52587.448000
max 102855.610250 102855.599500
#number of data packets per message
log_data_packets = log_data.groupby(['message_id']).count()
log_data_packets.reset_index(inplace=True)
log_data_packets.columns = ['message_id','number_packets']Moreover, as shown in the plot below, most messages are made up of less than 2500 data packets, with the highest number of data packets being 10,000 and the lowest being 4.
ggplot(log_data_packets, aes(x=number_packets)) +
geom_density(fill = "#192841", alpha = 0.5) +
theme_minimal() +
labs(x = "number of data packets",
title = "Density of the number of data packets per message") +
theme(plot.title = element_text(face = "bold"))Both
The dataset “log_data_general_metrics” was created with the columns to outline the mean and median time it takes for across all data packets of a given message to arrive to their destination, their number of packets, and their expected jitter period.
#putting general metrics about a particular message together
log_data_metrics_packets = log_data_merged.merge(log_data_packets,
on='message_id',
how='left')
log_data_general_metrics = log_data_metrics_packets.merge(periods_data,
on = "message_id",
how = "left")
log_data_general_metrics.head() message_id mean median number_packets period
0 49 52567.447523 52567.6510 10514 10.0
1 81 52568.711934 52568.7600 10514 10.0
2 257 52570.002043 52570.3815 10514 10.0
3 258 52576.545720 52576.7810 1052 100.0
4 259 52534.996414 52535.5020 1051 100.0
Three columns were added to the dataset “log_period_general” to hold:
difference: The difference between the time it took for a data packet to arrive to its destination relatively to the mean arrival time of all data packets of a given message. Its absolute value would be the period. However, the raw difference was calculated such that some analysis regarding how many data packets arrived earlier/later than their mean could be calculated later.actual_period: How earlier/later than the time arrival mean of all data packets across a message a specific data packet arrived. It is important to highlight that the reason why the actual_period is expressed as an absolute value is that, differently to what might be believed first, having a data packet arrive much earlier that their mean is not better than having it arrive later since both mean that there are significant gaps across the times when data packet arrived. Those gaps are what make up disruptions in communication.off: The difference between the actual period and the expected period of the data packet. An off value equal to zero or less is positive since that means that the data packet arrived within the expected period (e.g. if the actual period of a data packet is 5 and its expected period was 10, then it arrived within expectations and the “off” value is -5).
#merging log_data_general_metrics with the original log_data
log_period_general = log_data.merge(log_data_general_metrics,
on = "message_id",
how = "left").sort_values(by = 'message_id')
#reordering columns
log_period_general = log_period_general.reindex(columns=['message_id','timestamp (ms)', 'mean',
'median','number_packets', 'period'])
#create column to hold the difference of the arrival of each data packet relatively to their mean
#a positive difference means that the message arrive earlier relatively to the mean of the data packets of a given message
log_period_general['difference'] = log_period_general['timestamp (ms)'] - log_period_general['mean']
#create column to hold how off the actual period jitter of each data packet
log_period_general['actual_period'] = abs(log_period_general['difference'])
#create column to hold how off the actual period is from the expected period
log_period_general['off'] = log_period_general['actual_period'] - log_period_general['period']
#overview of dataset
log_period_general.head() message_id timestamp (ms) ... actual_period off
219511 49 65892.960 ... 13325.512477 13315.512477
117229 49 35192.227 ... 17375.220523 17365.220523
243916 49 73222.964 ... 20655.516477 20645.516477
182320 49 54732.758 ... 2165.310477 2155.310477
336231 49 100933.020 ... 48365.572477 48355.572477
[5 rows x 9 columns]
There were 207 data packets that were delivered on time. Given that there was a total 350,243 data packets across all messages, this means that only ~0.06% of data packets arrived within the expected Jitter period. All those data packets represent 99 messages. This means that 99% of messages in the log had at least one of their data packets arrived on time.
#subset of observations within expectation
within_expectation = log_period_general[log_period_general['off'] <= 0]
#number of observations within expectation
within_expectation.shape[0]
#number of uniques messages in the subset of observations within expectation 207
len(within_expectation['message_id'].unique())99
A dataset containing only those messages that had at least one packet arriving during the expected jitter period was creted with two columns: the number of its packets that arrived on time and the number of total packets that made up the message. As the table below shows, on average, out of all messages that had at least one packet arriving during the expected jitter period, only 0.4% of the message’ packets arrived on time, with the median at 0.1%. In the best case, 20% of the packets of a message arrived within the period.
#getting the message_ids that had at least one packet on time
within_expectation_id = within_expectation.iloc[:,0:1]
#counting how many packets of a given message arrived on time
within_expectation_packets = within_expectation_id.groupby('message_id').size().reset_index()
within_expectation_packets.columns = ['message_id', 'packets_within_period']
#merging it to get the column of their total number of packets
within_expectation_packets = within_expectation_packets.merge(log_data_packets,
on = 'message_id',
how = 'left')
#getting the ratio of number of packets on time and total number of packets
within_expectation_packets['ratio_packets'] = within_expectation_packets['packets_within_period']/within_expectation_packets['number_packets']
#retrieving metrics
within_expectation_packets.describe() message_id packets_within_period number_packets ratio_packets
count 99.000000 99.000000 99.000000 99.000000
mean 725.232323 2.090909 3325.363636 0.004211
std 384.387714 0.656069 3769.657941 0.006678
min 49.000000 1.000000 105.000000 0.000190
25% 389.500000 2.000000 1051.000000 0.000380
50% 680.000000 2.000000 1052.000000 0.001901
75% 924.000000 2.000000 5257.000000 0.001903
max 1869.000000 8.000000 10515.000000 0.028571
Given those results, one question was whether there was any relationships between the actual jitter period and the expected jitter period. The boxplot below shows that there is not. The only exception is when the expected Jitter period was 0 (all data packets of the message arrived at the same time, almost the same time - early/delay canceled out). In that case, the actual period was low as well.
ggplot(log_period_general, aes(x = period,
y = actual_period)) +
geom_boxplot(fill = "#192841", alpha = 0.5) +
theme_minimal() +
labs(x = 'expected period (ms)',
y = 'actual period (ms)',
title = 'Actual Jitter period vs Expected Jitter period') +
theme(plot.title = element_text(face = "bold"))Finally, given the striking results it was wondered whether there is any correlation between the number of data packets and the average actual period of message. The reason this query was relevant was because, if there is a correlation, then that could lead the study to wonder whether the statement: “The more data packets a message has, the higher the Jitter period” is true. And, if that was proven, then a proposed solution could be to allocate a higher Jitter period to those observation with higher data packets so the difference between the expected and actual Jitter period shrinks.
To test this, the average actual Jitter period per message was calculated and then its correlation with its number of data packets was computed. As shown below, the correlation coefficient did not even reach 0.01 so the hypothesis was discarded.
#extracting relevant columns
log_period_general_actual = log_period_general[['message_id', 'actual_period']]
#getting the mean of the actual jitter period for a message
log_period_general_actual = log_period_general_actual.groupby(['message_id']).mean()
log_period_general_actual.reset_index(inplace=True)
#merging it with the total number of packets
log_period_general_actual = log_period_general_actual.merge(log_data_packets, on = 'message_id', how = 'left')
#computing the correlation
pearson_correlation = log_period_general_actual['actual_period'].corr(log_period_general_actual['number_packets'])
pearson_correlation0.09057550188040804
Conclusions
The study reached the following conclusions:
Only ~0.06% of data packets arrived within the expected Jitter period.
99% of messages in the log had at least one of their data packets arrived on time.
The maximum percentage of data packets that any message had that arrived on time was 20%. The mean was 0.4%. This means that even though 99% of messages have data packets arriving on time, those data packets make up only 0.4% of the message in average.
The number of data packets a message has does not correlate with its Jitter period.