1 Goal


The goal of this task is to conduct an Exploratory Data Analysis (EDA) on data, that contains information about survey responses to Mobility Services .


2 Universal Libraries


Theese Libraries are used throught the code. Libraries specific to the line of code are defined with the code.

library(ggplot2)    #library for Visualizations
library(dplyr)      #library for Data Manipulation
library(lubridate)  # library for managing time based data
library(stringr)    #Used to wrap text in plot labels

3 Data Import


Importing the dataset and having a look at the structure of the dataset to get an understanding of the Data.

mydata <-read.csv(file="Data_Analyst_Assigment_Dataset - 2018.csv", header=TRUE, sep=";")
summary(mydata)
##        Id             Day                City               Rating     
##  Min.   :   1.0   Length:1298        Length:1298        Min.   :1.000  
##  1st Qu.: 325.2   Class :character   Class :character   1st Qu.:3.000  
##  Median : 649.5   Mode  :character   Mode  :character   Median :3.000  
##  Mean   : 649.5                                         Mean   :3.327  
##  3rd Qu.: 973.8                                         3rd Qu.:4.000  
##  Max.   :1298.0                                         Max.   :5.000  
##     X             X.1            X.2         
##  Mode:logical   Mode:logical   Mode:logical  
##  NA's:1298      NA's:1298      NA's:1298     
##                                              
##                                              
##                                              
## 
str(mydata)
## 'data.frame':    1298 obs. of  7 variables:
##  $ Id    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Day   : chr  "07.12.2015" "07.12.2015" "07.12.2015" "07.12.2015" ...
##  $ City  : chr  "Prag (Busbahnhof ÚAN Florenc)" "Amsterdam Sloterdijk" "Prag Hbf" "Wien Stadion Center (Busbahnhof)" ...
##  $ Rating: int  3 5 4 2 5 4 4 4 3 5 ...
##  $ X     : logi  NA NA NA NA NA NA ...
##  $ X.1   : logi  NA NA NA NA NA NA ...
##  $ X.2   : logi  NA NA NA NA NA NA ...

ID : This is an Integer. It would be interesting to see if there are any repeating values of ID.
Day : Day is a Date Time variable and needs to be assigned a Time Class.
City: City is a string with 8 levels.
Rating : Rating is a number but could be treated as a Factor for this context.
X, X.1,X.2 : Theese are Garbage values and need to be removed as they only contain NA.


4 Data Wrangling / Feature Engineering


Assigining various classes to the data

mydata %>% summarise(count_all = n_distinct(mydata$Id)) # Checking if ID has duplicates 
##   count_all
## 1      1298
#There are no duplicate ID

Selecting relevant Columns into a new Data frame from the old.
Assigining the date a POSIXct class.
Adding an extra column for the day of the week.

mydata1 <- mydata[c(2,3,4)] #select the relevant columns
mydata1$Day <- as.POSIXct(mydata1$Day, format='%d.%m.%Y') #Change to Date format
mydata1$Weekday  <- as.factor(weekdays(mydata1$Day)) # Add extra column for the day of the week

Adding the Country name to a new column in the dataframe.

User Defined Function for multiple string subsitution

mgsub <- function(pattern, replacement, x, ...) {
  if (length(pattern)!=length(replacement)) {
    stop("pattern and replacement do not have the same length.")
  }
  result <- x
  for (i in 1:length(pattern)) {
    result <- gsub(pattern[i], replacement[i], result, ...)
  }
  result
}

Replacing City name with Country in the new column

mydata1$Country <- mydata1$City  #copying cities into a new column
mydata1$Country  <- gsub("\\(|\\)","", mydata1$Country) #removing "(" and ")"
mydata1$Country <-as.factor(mgsub(c("Berlin ZOB","Amsterdam Sloterdijk",
                          "Bruxelles Gare du Midi","Z?rich HB Carpark Sihlquai",
                          "Prag Busbahnhof ?AN Florenc","Prag Hbf",
                          "Wien Stadion Center Busbahnhof",
                          "Paris Charles de Gaulle Terminal 3 - Busbahnhof "), 
                          c("Germany","Netherlands","Belgium",
                            "Switzerland","Czech Republic",
                            "Czech Republic","Austria","France"),
                             mydata1$Country))
str(mydata1)
## 'data.frame':    1298 obs. of  5 variables:
##  $ Day    : POSIXct, format: "2015-12-07" "2015-12-07" ...
##  $ City   : chr  "Prag (Busbahnhof ÚAN Florenc)" "Amsterdam Sloterdijk" "Prag Hbf" "Wien Stadion Center (Busbahnhof)" ...
##  $ Rating : int  3 5 4 2 5 4 4 4 3 5 ...
##  $ Weekday: Factor w/ 6 levels "Friday","Monday",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Country: Factor w/ 8 levels "Austria","Belgium",..: 7 6 3 1 5 2 6 2 5 2 ...
mydata2 <- mydata1 #saving a dataframe with Rating as numeric values
mydata1$Rating<- as.factor(mydata1$Rating) # Changing Rating from numeric to Factor

5 Visualization 1 (City, Country, Day)


City/Country Vs Rating : Does Flixbus’s quality of Services vary among different cities and Countries ? This might be visible on the rating.

ggplot(data = mydata1, aes(x = City, y = , fill = Rating)) + geom_bar() +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
  ggtitle("City_Vs_Ratings") +
  theme(plot.title = element_text(hjust = 0.5))

It is observed that the Highest 5 star rating is obtained from passengers starting from Amsterdam and the lowest from Paris.
The Rating seems to be fairly Homogenously distributed hence it could be infered that the Quality of Service is almost equall for all origins.

ggplot(data = mydata1, aes(x = Country, y = , fill = Rating)) +geom_bar() +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
  ggtitle("Country_Vs_Ratings") + theme(plot.title = element_text(hjust = 0.5))

The same Pattern is observed for countries as well hence ratings are almost equall from all origins.

Days Vs Rating : Does Flixbus’s quality of Services vary from day to day? Do ratings go high on some days?

positions <- c("Monday", "Tuesday", "Wednesday", "Thursday","Friday", "Saturday")
ggplot(data = mydata1, aes(x = Weekday, y = , fill = Rating)) + geom_bar() +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10),limits = positions) +
  ggtitle("Days_Vs_Ratings") + theme(plot.title = element_text(hjust = 0.5))

Mondays seem to have the Highest 5 star rating as the number of travellers on Monday were the maximum.
Wednesday, Thursday and Friday seem to have a large number of 3 star ratings which could be improved.


6 Visualization 2 (Average Ratings)


Average Ratings : Which days, cities or countries are the best rated ?
We take the average of the Ratings from a City, Country and Day of the week.

mydataCity <- aggregate(mydata2, by =list(mydata2$City),FUN = mean)
mydataDate <- aggregate(mydata2, by =list(mydata2$Weekday),FUN = mean , na.action = na.omit)
mydataCountry <- aggregate(mydata2, by =list(mydata2$Country),FUN = mean , na.action = na.omit)

a <- ggplot(data = mydataCity, aes(x = Group.1, y = Rating )) + 
  geom_point(fill='#A4A4A4',color="darkred")  + xlab("city") +
  scale_x_discrete(label=function(x) abbreviate(x, minlength=7))+
  ggtitle("Ratings Vs..")+ theme(plot.title = element_text(hjust = 0.5))

b <- ggplot(data = mydataDate, aes(x = Group.1, y = Rating ))+ 
  geom_point(color="darkred")   +xlab("Days")+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10),limits = positions)

c <- ggplot(data = mydataCountry, aes(x = Group.1, y = Rating )) +
  geom_point(color="darkred") + xlab("Country")

library("gridExtra")
grid.arrange(a,b,c)

The Rating across cities and countries varies very little.
The lowest average rating is from Belgium and the highest is from Czech Republic.
The highest rating is seen on Monday and it generally decreases upto Thursday and then goes higher on Saturday.
Theese are Micro Variations as the overall rating is between 3 and 3.6 for every parameter.


7 Applying Machine Learning -Random Forest


Selecting City, Country and Day for Predicting Rating.

library(caret)  #ML library
mydata3 <-  mydata1[c(2,3,4,5)]

Creating a 80% - 20% partition for the Train and the Test set using the Caret library.

set.seed(345)
indexes <- createDataPartition(y=mydata3$Rating, times=1,p=0.7,list=FALSE) 
trainSet<- mydata3[indexes,]
testSet <- mydata3[-indexes,]

Training a Random Forest algorithm.

set.seed(2334)
ctrl <- trainControl(method="repeatedcv",number=2,repeats = 2) 
start.time <- Sys.time()  #To note the time taken to train an algorithm
Rf_tune <- train(Rating~., data = trainSet, method= "rf",preProcess = c( "center","scale"),
                 trControl = ctrl, tuneLength = 8)
end.time <- Sys.time()
saveRDS(Rf_tune, file = "Rf_tune") # Saving the model 
time.taken <- end.time - start.time
print(time.taken)
## Time difference of 12.27972 secs
Rf_tune
## Random Forest 
## 
## 911 samples
##   3 predictor
##   5 classes: '1', '2', '3', '4', '5' 
## 
## Pre-processing: centered (19), scaled (19) 
## Resampling: Cross-Validated (2 fold, repeated 2 times) 
## Summary of sample sizes: 455, 456, 455, 456 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.4769339  0.1686284
##    4    0.4769339  0.1738712
##    6    0.4566235  0.1551943
##    9    0.4511338  0.1513551
##   11    0.4555246  0.1553749
##   14    0.4511350  0.1550139
##   16    0.4478396  0.1516606
##   19    0.4440042  0.1458485
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.

The model has a Train accuracy of 45%.

Testing

Rf_test = predict(Rf_tune, newdata=testSet,metric= accuracy)
postResample(Rf_test, testSet$Rating)
##  Accuracy     Kappa 
## 0.4444444 0.1206405
confusionMatrix(data = Rf_test, testSet$Rating)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  4  5
##          1  0  0  0  0  0
##          2  0  0  0  0  0
##          3  6 32 79 62  7
##          4 12 22 50 93 24
##          5  0  0  0  0  0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4444          
##                  95% CI : (0.3942, 0.4955)
##     No Information Rate : 0.4005          
##     P-Value [Acc > NIR] : 0.04402         
##                                           
##                   Kappa : 0.1206          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.00000   0.0000   0.6124   0.6000   0.0000
## Specificity           1.00000   1.0000   0.5853   0.5345   1.0000
## Pos Pred Value            NaN      NaN   0.4247   0.4627      NaN
## Neg Pred Value        0.95349   0.8605   0.7512   0.6667   0.9199
## Prevalence            0.04651   0.1395   0.3333   0.4005   0.0801
## Detection Rate        0.00000   0.0000   0.2041   0.2403   0.0000
## Detection Prevalence  0.00000   0.0000   0.4806   0.5194   0.0000
## Balanced Accuracy     0.50000   0.5000   0.5988   0.5672   0.5000

The model has a Test accuracy of 46.5%.


8 Conclusions


Given more data and labels, a Machine Learning Model can be trained to predict the rating that a customer would provide.
The overall dataset seems to be very homogenous and not many observations could be observed. More data could provide better insights
The ratings are a very good KPI for Service Quality