Homework assignment No.2

Describe Your Data:

  1. Data types.
  2. Statistcs (mean, min, max, etc. depending on the data types), use box plots and other similar plots to illustrate it.
  3. Create basic visualizations of your data.
  4. Check for periodicity in your data, show it (if there is no seasonality, show that there is no seasonality).

Data types

  1. CargoOrderID - an integer representing the ID given to the cargo;
  2. Value - a real (float) number representing the value of the cargo;
  3. SenderCity - a character string representing the city from which the cargo needs to be picked up;
  4. ReceiverCity - a character string representing to which city will the cargo be delivered;
  5. DateCreated - a datetime string representing when was the request to deliver the cargo created;
  6. SenderX - a real (float) number representing the decimal degrees of the latitude of the object from which it needs to be picked up;
  7. SenderY - a real (float) number representing the decimal degrees of the longitude of the object from which it needs to be picked up;
  8. ReceiverX - a real (float) number representing the decimal degrees of the latitude of the object to which it needs to be delivered;
  9. ReceiverY - a real (float) number representing the decimal degrees of the longitude of the object to which it needs to be delivered;
  10. LDM - a real (float) number representing the loading meters of the cargo;
  11. Weight - a real (float) number representing the weight of the cargo;
  12. Volume - a real (float) number representing the volume of the cargo;
  13. FirstDimension - a real (float) value representing the first (length) dimension of the cargo;
  14. SecondDimension - a real (float) value representing the second (width) dimension of the cargo;
  15. UnitTypeName - a character string representing the type of the cargo in logistical terms;
  16. UnitTypeID - an integer representing the ID given to the cargo type;
  17. Terminal_s - a boolean (binary) value representing whether the cargo was redirected to a terminal in the country from which it is picked up;
  18. Terminal_r - a boolean (binary) value representing whether the cargo was redirected to a terminal in the country to which it is delivered.

Visualizations

ggplot(data, aes(x=Value)) + 
  geom_histogram(color="darkblue", fill="lightblue", bins = 26) +
  labs(title = "Distribution of the values of cargo", x = "Value", y = "Count") + 
  theme_light()

The graph shows, that most of the cargo is under 100 euros and the rest are under 500 euros with several exceptions of up to 2500 euros per cargo.

par(mfrow = c(1,3))
boxplot(data$LDM, main = "Distribution of LDM\nof the cargo", ylab = "LDM")
boxplot(data$Weight, main = "Distribution of Weight\nof the cargo", ylab = "Weight")
boxplot(data$Volume, main = "Distribution of Volume\nof the cargo", ylab = "Volume")

The tendency of the previous graph is visible here: due to most of the cargo being of low-value, their dimensionality attributes are low as well.

ggplot(data, aes(x=UnitTypeID, y=datecount, fill=UnitTypeID)) +
  geom_bar(stat="identity") + 
  labs(title = "Number of different types of cargo",
       x = "Unit type ID", y = "Count") + 
  theme_light()

The highest number of cargo belongs to the type EP(120x80x220) cargo, that corresponds to the unit type ID 33, with 25 cargo. The runner-up was the VNT type (unit type ID 47) with 17 cargo.

ggplot(data, aes(x=Terminal_s, y=datecount, fill = Terminal_s)) +
  geom_bar(stat="identity") + 
  labs(title = "Number of cargo that was redirected to a terminal in the sender country",
       x = "", y = "Count") + 
  theme_light()

From this graph we can see that 45 cargo were not redirected to a terminal, leaving 21 (total 66) that were redirected to a terminal in the country from which it was sent.

ggplot(data, aes(x=Terminal_r, y=datecount, fill = Terminal_r)) +
  geom_bar(stat="identity") + 
  labs(title = "Number of cargo that was redirected to a terminal in the sender country",
       x = "", y = "Count") + 
  theme_light()

From this graph we can see that 50 of cargo were not redirected to a terminal, leaving 16 that were redirected to a terminal in the country to which it was sent.

par(mfrow = c(1,2))
cdata <- data[-which(is.na(as.numeric(data$FirstDimension))),]
## Warning in which(is.na(as.numeric(data$FirstDimension))): NAs introduced by
## coercion
cdata$FirstDimension <- as.numeric(cdata$FirstDimension)
cdata$SecondDimension <- as.numeric(cdata$SecondDimension)
boxplot(cdata$FirstDimension, main = "Distribution of the\nFirstDimension of the cargo", ylab = "LDM")
boxplot(cdata$SecondDimension, main = "Distribution of the\nSecondDimension of the cargo", ylab = "Weight")

Here we can see that more than half of the cargo had the same values for both first and second dimensions, thus resulting in the boxplots portrayed in the graphs.

Sys.setlocale("LC_ALL","English")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
ggplot(data = data[-1,], aes(x = date, y = datecount)) +
  geom_bar(stat = "identity", fill = "purple") +
  labs(title = "Number of cargo requests per day",
       x = "Date", y = "Count")

In this graph the 8th, 15th, 22th and 29th days of November mark the Mondays. From this graph we can see that there were no cargo orders made on the weekdays. Also, a sort of cyclical seasonality can be witnessed where from Monday to Thursday there’s a growth of cargo that drops to a low number of orders on Friday and drop to zero cargo orders on weekends.

It should be noted, that this dataset has a low number of values and spans through a short period of time for proper seasonality evaluation.

As per request of the lecturer, another way to visualize this plot is presented.

ggplot(df, aes(x=day, y=number, group=week, color=week)) +
  geom_line(size = 2)+theme_light() +
  labs(title = "Number of cargo requests per day",
       x = "", y = "Number of cargo")