Part 1 - Introduction:

Introduction Cruise ships are increasingly becoming a major part of tourism. Having amenities such as food and entertainment all in one place is convenient for older generations and also appealing to families with young kids such as mine. This is the reason, I immediately became interested in the cruise data for analysis when searching online for data sets.

Research Question: Based on the data available, I would like to find out what is a good predictor model for number of crews needed on a cruise ship.

Part 2 - Data:

Data Collection The data set is obtained from University of Florida Statistics website in a text format. As per the UOF website, the data was collected from www.truecruse.com.

Cases There are 485 observations. Each observation contains a different cruise ship and columns represent variables about its characteristics.

Variables

variable description
ship name of ship
line line of ship
age age goes up to 2013
tonnage weight of ship in tonnage
passengers passengers on board (100s)
length length of the ship (100s)
cabins number of cabins (100s)
pasgrden passenger density
crew number of crews available in 100s

We will be looking at explanatory variables age, tonnage, passengers, length, cabins and passgrden and how they influence the number of crew on board.

Type of Study This is an observation study because the data is collected as they appeared. There was no intervention made to cases or variables to influence any data results.

Scope of Inference - generalizability The population interests are cruise goers, cruise ships and the crew. I believe there is enough data to generalize the findings to the most cruise ships and its number of crews.

Scope of inference - causality Since this is an observational study, you cannot find a cause or effect between the variables of interest.

Part 3 - Exploratory data analysis:

Load Data

# Load packages we will use later

library(knitr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.2.5

Let us load the Cruise Ship data

cruise <- read.csv("cruise_ships.csv")
kable(head(cruise))
ship line age tonnage passengers length cabins passden crew
Journey Azamara 6 30.277 6.94 5.94 3.55 42.64 3.55
Quest Azamara 6 30.277 6.94 5.94 3.55 42.64 3.55
Celebration Carnival 26 47.262 14.86 7.22 7.43 31.80 6.70
Conquest Carnival 11 110.000 29.74 9.53 14.88 36.99 19.10
Destiny Carnival 17 101.353 26.42 8.92 13.21 38.36 10.00
Ecstasy Carnival 22 70.367 20.52 8.55 10.20 34.29 9.20

Summarize Data

# examine column data types
str(cruise)
## 'data.frame':    158 obs. of  9 variables:
##  $ ship      : Factor w/ 138 levels "Adventure","Allegra",..: 53 92 12 16 21 24 25 34 35 37 ...
##  $ line      : Factor w/ 20 levels "Azamara","Carnival",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ age       : int  6 6 26 11 17 22 15 23 19 6 ...
##  $ tonnage   : num  30.3 30.3 47.3 110 101.4 ...
##  $ passengers: num  6.94 6.94 14.86 29.74 26.42 ...
##  $ length    : num  5.94 5.94 7.22 9.53 8.92 8.55 8.55 8.55 8.55 9.51 ...
##  $ cabins    : num  3.55 3.55 7.43 14.88 13.21 ...
##  $ passden   : num  42.6 42.6 31.8 37 38.4 ...
##  $ crew      : num  3.55 3.55 6.7 19.1 10 9.2 9.2 9.2 9.2 11.5 ...
# summarise the data
summary(cruise)
##       ship                   line         age           tonnage       
##  Spirit :  4   Royal Caribbean :23   Min.   : 4.00   Min.   :  2.329  
##  Legend :  3   Carnival        :22   1st Qu.:10.00   1st Qu.: 46.013  
##  Star   :  3   Princess        :17   Median :14.00   Median : 71.899  
##  Crown  :  2   Holland American:14   Mean   :15.69   Mean   : 71.285  
##  Dawn   :  2   Norwegian       :13   3rd Qu.:20.00   3rd Qu.: 90.772  
##  Freedom:  2   Costa           :11   Max.   :48.00   Max.   :220.000  
##  (Other):142   (Other)         :58                                    
##    passengers        length           cabins          passden     
##  Min.   : 0.66   Min.   : 2.790   Min.   : 0.330   Min.   :17.70  
##  1st Qu.:12.54   1st Qu.: 7.100   1st Qu.: 6.133   1st Qu.:34.57  
##  Median :19.50   Median : 8.555   Median : 9.570   Median :39.09  
##  Mean   :18.46   Mean   : 8.131   Mean   : 8.830   Mean   :39.90  
##  3rd Qu.:24.84   3rd Qu.: 9.510   3rd Qu.:10.885   3rd Qu.:44.19  
##  Max.   :54.00   Max.   :11.820   Max.   :27.000   Max.   :71.43  
##                                                                   
##       crew       
##  Min.   : 0.590  
##  1st Qu.: 5.480  
##  Median : 8.150  
##  Mean   : 7.794  
##  3rd Qu.: 9.990  
##  Max.   :21.000  
## 

As noted above, we will be using age,tonnage,passengers, length, cabins and passden as our predictor variables to fit multiple linear regression equation. Our response variable will be crew.

Matrix Scatterplot

Create matrix scatterplot for 6 variables.

plot(cruise[3:9],col='blue', main="Scatterplot Matrix")

From scatterplot matrix we can note that there is strong positive linear relationship between crew and cabins, length, passengers and tonnage. The age variable has moderate weak relationship to the crew. Lastly, passenger density shows weak positive relationship to the number of crews on board.

Correlation plot

modeldata = cor(cruise[c(4:7,9)])
corrplot(modeldata, method = "number")

The correlation between number of cabins on ship and number of crews is the highest at 0.95. This suggests number of cabins is strongly and positively related to the number of crews on board.

Fit LRM

Let us start fitting a linear regression on this dataset using _backward-selection.

Model 1

model1 = lm(crew ~ age + tonnage + passengers + length + cabins + passden, data=cruise)
summary(model1)
## 
## Call:
## lm(formula = crew ~ age + tonnage + passengers + length + cabins + 
##     passden, data = cruise)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7700 -0.4881 -0.0938  0.4454  7.0077 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.5213400  1.0570350  -0.493  0.62258    
## age         -0.0125449  0.0141975  -0.884  0.37832    
## tonnage      0.0132410  0.0118928   1.113  0.26732    
## passengers  -0.1497640  0.0475886  -3.147  0.00199 ** 
## length       0.4034785  0.1144548   3.525  0.00056 ***
## cabins       0.8016337  0.0892227   8.985 9.84e-16 ***
## passden     -0.0006577  0.0158098  -0.042  0.96687    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9819 on 151 degrees of freedom
## Multiple R-squared:  0.9245, Adjusted R-squared:  0.9215 
## F-statistic:   308 on 6 and 151 DF,  p-value: < 2.2e-16

There are few variables that are not statistically different from zero: age, tonnage and passden. We will drop the passden since this has the larger corresponding p-value.

Let us re-fit the model

Model 2

model2 = lm(crew ~ age + tonnage + passengers + length + cabins, data=cruise)
summary(model2)
## 
## Call:
## lm(formula = crew ~ age + tonnage + passengers + length + cabins, 
##     data = cruise)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7702 -0.4885 -0.0896  0.4447  7.0066 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.555922   0.650835  -0.854 0.394357    
## age         -0.012354   0.013395  -0.922 0.357834    
## tonnage      0.012915   0.008922   1.448 0.149784    
## passengers  -0.148627   0.038837  -3.827 0.000189 ***
## length       0.403835   0.113758   3.550 0.000513 ***
## cabins       0.802165   0.088013   9.114 4.37e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9786 on 152 degrees of freedom
## Multiple R-squared:  0.9245, Adjusted R-squared:  0.922 
## F-statistic:   372 on 5 and 152 DF,  p-value: < 2.2e-16

Now variable age has the large corresponding p-value. We will remove it and re-fit the model again.

Model 3

model3 = lm(crew ~ tonnage + passengers + length + cabins, data=cruise)
summary(model3)
## 
## Call:
## lm(formula = crew ~ tonnage + passengers + length + cabins, data = cruise)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7593 -0.4639 -0.0716  0.4698  7.0239 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.818710   0.584866  -1.400 0.163590    
## tonnage      0.016319   0.008119   2.010 0.046185 *  
## passengers  -0.149851   0.038795  -3.863 0.000165 ***
## length       0.397554   0.113499   3.503 0.000604 ***
## cabins       0.790837   0.087109   9.079 5.18e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9782 on 153 degrees of freedom
## Multiple R-squared:  0.924,  Adjusted R-squared:  0.9221 
## F-statistic: 465.3 on 4 and 153 DF,  p-value: < 2.2e-16

In the final model, we see there is not strong evidence that the co-efficient for variables is different than zero. There are no variables remaining that can be eliminated from the model.

Our multiple regression model is

\(\hat{crew}\) = -0.818710 + 0.016319 * tonnage - 0.149851 * passengers + 0.397554 * length + 0.790837 * cabins

The Adjusted R-Squared value is 0.924, which states that this model can explain the variation in number of crews on the ship 92.4% of the time.

Check Model Assumptions

plot(model3)

plot(cruise$tonnage, model3$residuals )
plot(cruise$passengers, model3$residuals )
plot(cruise$length, model3$residuals )
plot(cruise$cabins, model3$residuals )

plot(model3$residuals ~ c(1:length(model3$residuals)), ylab = "Residuals", xlab ="Order of Collection")

Normal probability plot: The qqplot appears to be nearly normal.

The variability of the residual is nearly constant: Do not see any apparent pattern. The residuals are nearly constant.

Residuals are independent: DO not see any pattern that would indicate a problem. The residuals in order they are collected appear independent.

Each variable is linearly related to the outcome: Residuals against each predictor variables appear linear.

Part 4 - Inference:

Inference for the model as a whole.

\(H_0\): \(\beta_1\) = \(\beta_2\) = … \(\beta_p\) = 0. The slope is zero when other variables are included in the model

\(H_A\): At least one of the \(\beta_i\) not equal to 0. The slope is not zero when other variables are included in the model.

Since p-value < 0.05, we reject null hypothesis and at least one of the \(\beta\) is non-zero.

Inference of slope

Given all variables, which ones are significant predictors for number of crews?

summary(model1)
## 
## Call:
## lm(formula = crew ~ age + tonnage + passengers + length + cabins + 
##     passden, data = cruise)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7700 -0.4881 -0.0938  0.4454  7.0077 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.5213400  1.0570350  -0.493  0.62258    
## age         -0.0125449  0.0141975  -0.884  0.37832    
## tonnage      0.0132410  0.0118928   1.113  0.26732    
## passengers  -0.1497640  0.0475886  -3.147  0.00199 ** 
## length       0.4034785  0.1144548   3.525  0.00056 ***
## cabins       0.8016337  0.0892227   8.985 9.84e-16 ***
## passden     -0.0006577  0.0158098  -0.042  0.96687    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9819 on 151 degrees of freedom
## Multiple R-squared:  0.9245, Adjusted R-squared:  0.9215 
## F-statistic:   308 on 6 and 151 DF,  p-value: < 2.2e-16

Given all other variables in the model, passengers, length and cabins have p-value < 0.05, so they are significant predictors.

Part 5 - Conclusion:

Based on the multiple regression model, there are number of variables that influence the crew on board the ship. Having a bigger ship with lot of cabins would definitely increase the number of crews, without coming up with a model. In the future study, I would have like to have a variable with a quality of service. I would like to know if quality of service influence the crew.

References:

Data Source: [http://www.truecruse.com]

Multiple Regression Notes: [https://www2.stat.duke.edu/courses/Summer13/sta104.01-1/slides/unit7lec2H.pdf]

Linear Regression Notes: [https://htmlpreview.github.io/?https://github.com/jbryer/IS606Fall2015/blob/master/Pages/Linear_Regression_SAT.html]