Introduction Cruise ships are increasingly becoming a major part of tourism. Having amenities such as food and entertainment all in one place is convenient for older generations and also appealing to families with young kids such as mine. This is the reason, I immediately became interested in the cruise data for analysis when searching online for data sets.
Research Question: Based on the data available, I would like to find out what is a good predictor model for number of crews needed on a cruise ship.
Data Collection The data set is obtained from University of Florida Statistics website in a text format. As per the UOF website, the data was collected from www.truecruse.com.
Cases There are 485 observations. Each observation contains a different cruise ship and columns represent variables about its characteristics.
Variables
variable | description |
---|---|
ship | name of ship |
line | line of ship |
age | age goes up to 2013 |
tonnage | weight of ship in tonnage |
passengers | passengers on board (100s) |
length | length of the ship (100s) |
cabins | number of cabins (100s) |
pasgrden | passenger density |
crew | number of crews available in 100s |
We will be looking at explanatory variables age
, tonnage
, passengers
, length
, cabins
and passgrden
and how they influence the number of crew
on board.
Type of Study This is an observation study because the data is collected as they appeared. There was no intervention made to cases or variables to influence any data results.
Scope of Inference - generalizability The population interests are cruise goers, cruise ships and the crew. I believe there is enough data to generalize the findings to the most cruise ships and its number of crews.
Scope of inference - causality Since this is an observational study, you cannot find a cause or effect between the variables of interest.
Load Data
# Load packages we will use later
library(knitr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.2.5
Let us load the Cruise Ship data
cruise <- read.csv("cruise_ships.csv")
kable(head(cruise))
ship | line | age | tonnage | passengers | length | cabins | passden | crew |
---|---|---|---|---|---|---|---|---|
Journey | Azamara | 6 | 30.277 | 6.94 | 5.94 | 3.55 | 42.64 | 3.55 |
Quest | Azamara | 6 | 30.277 | 6.94 | 5.94 | 3.55 | 42.64 | 3.55 |
Celebration | Carnival | 26 | 47.262 | 14.86 | 7.22 | 7.43 | 31.80 | 6.70 |
Conquest | Carnival | 11 | 110.000 | 29.74 | 9.53 | 14.88 | 36.99 | 19.10 |
Destiny | Carnival | 17 | 101.353 | 26.42 | 8.92 | 13.21 | 38.36 | 10.00 |
Ecstasy | Carnival | 22 | 70.367 | 20.52 | 8.55 | 10.20 | 34.29 | 9.20 |
Summarize Data
# examine column data types
str(cruise)
## 'data.frame': 158 obs. of 9 variables:
## $ ship : Factor w/ 138 levels "Adventure","Allegra",..: 53 92 12 16 21 24 25 34 35 37 ...
## $ line : Factor w/ 20 levels "Azamara","Carnival",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ age : int 6 6 26 11 17 22 15 23 19 6 ...
## $ tonnage : num 30.3 30.3 47.3 110 101.4 ...
## $ passengers: num 6.94 6.94 14.86 29.74 26.42 ...
## $ length : num 5.94 5.94 7.22 9.53 8.92 8.55 8.55 8.55 8.55 9.51 ...
## $ cabins : num 3.55 3.55 7.43 14.88 13.21 ...
## $ passden : num 42.6 42.6 31.8 37 38.4 ...
## $ crew : num 3.55 3.55 6.7 19.1 10 9.2 9.2 9.2 9.2 11.5 ...
# summarise the data
summary(cruise)
## ship line age tonnage
## Spirit : 4 Royal Caribbean :23 Min. : 4.00 Min. : 2.329
## Legend : 3 Carnival :22 1st Qu.:10.00 1st Qu.: 46.013
## Star : 3 Princess :17 Median :14.00 Median : 71.899
## Crown : 2 Holland American:14 Mean :15.69 Mean : 71.285
## Dawn : 2 Norwegian :13 3rd Qu.:20.00 3rd Qu.: 90.772
## Freedom: 2 Costa :11 Max. :48.00 Max. :220.000
## (Other):142 (Other) :58
## passengers length cabins passden
## Min. : 0.66 Min. : 2.790 Min. : 0.330 Min. :17.70
## 1st Qu.:12.54 1st Qu.: 7.100 1st Qu.: 6.133 1st Qu.:34.57
## Median :19.50 Median : 8.555 Median : 9.570 Median :39.09
## Mean :18.46 Mean : 8.131 Mean : 8.830 Mean :39.90
## 3rd Qu.:24.84 3rd Qu.: 9.510 3rd Qu.:10.885 3rd Qu.:44.19
## Max. :54.00 Max. :11.820 Max. :27.000 Max. :71.43
##
## crew
## Min. : 0.590
## 1st Qu.: 5.480
## Median : 8.150
## Mean : 7.794
## 3rd Qu.: 9.990
## Max. :21.000
##
As noted above, we will be using age
,tonnage
,passengers
, length
, cabins
and passden
as our predictor variables to fit multiple linear regression equation. Our response variable will be crew
.
Matrix Scatterplot
Create matrix scatterplot for 6 variables.
plot(cruise[3:9],col='blue', main="Scatterplot Matrix")
From scatterplot matrix we can note that there is strong positive linear relationship between crew and cabins
, length
, passengers
and tonnage
. The age
variable has moderate weak relationship to the crew
. Lastly, passenger density shows weak positive relationship to the number of crews on board.
Correlation plot
modeldata = cor(cruise[c(4:7,9)])
corrplot(modeldata, method = "number")
The correlation between number of cabins on ship and number of crews is the highest at 0.95. This suggests number of cabins is strongly and positively related to the number of crews on board.
Fit LRM
Let us start fitting a linear regression on this dataset using _backward-selection.
model1 = lm(crew ~ age + tonnage + passengers + length + cabins + passden, data=cruise)
summary(model1)
##
## Call:
## lm(formula = crew ~ age + tonnage + passengers + length + cabins +
## passden, data = cruise)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7700 -0.4881 -0.0938 0.4454 7.0077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5213400 1.0570350 -0.493 0.62258
## age -0.0125449 0.0141975 -0.884 0.37832
## tonnage 0.0132410 0.0118928 1.113 0.26732
## passengers -0.1497640 0.0475886 -3.147 0.00199 **
## length 0.4034785 0.1144548 3.525 0.00056 ***
## cabins 0.8016337 0.0892227 8.985 9.84e-16 ***
## passden -0.0006577 0.0158098 -0.042 0.96687
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9819 on 151 degrees of freedom
## Multiple R-squared: 0.9245, Adjusted R-squared: 0.9215
## F-statistic: 308 on 6 and 151 DF, p-value: < 2.2e-16
There are few variables that are not statistically different from zero: age
, tonnage
and passden
. We will drop the passden
since this has the larger corresponding p-value.
Let us re-fit the model
model2 = lm(crew ~ age + tonnage + passengers + length + cabins, data=cruise)
summary(model2)
##
## Call:
## lm(formula = crew ~ age + tonnage + passengers + length + cabins,
## data = cruise)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7702 -0.4885 -0.0896 0.4447 7.0066
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.555922 0.650835 -0.854 0.394357
## age -0.012354 0.013395 -0.922 0.357834
## tonnage 0.012915 0.008922 1.448 0.149784
## passengers -0.148627 0.038837 -3.827 0.000189 ***
## length 0.403835 0.113758 3.550 0.000513 ***
## cabins 0.802165 0.088013 9.114 4.37e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9786 on 152 degrees of freedom
## Multiple R-squared: 0.9245, Adjusted R-squared: 0.922
## F-statistic: 372 on 5 and 152 DF, p-value: < 2.2e-16
Now variable age
has the large corresponding p-value. We will remove it and re-fit the model again.
model3 = lm(crew ~ tonnage + passengers + length + cabins, data=cruise)
summary(model3)
##
## Call:
## lm(formula = crew ~ tonnage + passengers + length + cabins, data = cruise)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7593 -0.4639 -0.0716 0.4698 7.0239
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.818710 0.584866 -1.400 0.163590
## tonnage 0.016319 0.008119 2.010 0.046185 *
## passengers -0.149851 0.038795 -3.863 0.000165 ***
## length 0.397554 0.113499 3.503 0.000604 ***
## cabins 0.790837 0.087109 9.079 5.18e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9782 on 153 degrees of freedom
## Multiple R-squared: 0.924, Adjusted R-squared: 0.9221
## F-statistic: 465.3 on 4 and 153 DF, p-value: < 2.2e-16
In the final model, we see there is not strong evidence that the co-efficient for variables is different than zero. There are no variables remaining that can be eliminated from the model.
Our multiple regression model is
\(\hat{crew}\) = -0.818710 + 0.016319 * tonnage - 0.149851 * passengers + 0.397554 * length + 0.790837 * cabins
The Adjusted R-Squared value is 0.924, which states that this model can explain the variation in number of crews on the ship 92.4% of the time.
Check Model Assumptions
plot(model3)
plot(cruise$tonnage, model3$residuals )
plot(cruise$passengers, model3$residuals )
plot(cruise$length, model3$residuals )
plot(cruise$cabins, model3$residuals )
plot(model3$residuals ~ c(1:length(model3$residuals)), ylab = "Residuals", xlab ="Order of Collection")
Normal probability plot: The qqplot appears to be nearly normal.
The variability of the residual is nearly constant: Do not see any apparent pattern. The residuals are nearly constant.
Residuals are independent: DO not see any pattern that would indicate a problem. The residuals in order they are collected appear independent.
Each variable is linearly related to the outcome: Residuals against each predictor variables appear linear.
Inference for the model as a whole.
\(H_0\): \(\beta_1\) = \(\beta_2\) = … \(\beta_p\) = 0. The slope is zero when other variables are included in the model
\(H_A\): At least one of the \(\beta_i\) not equal to 0. The slope is not zero when other variables are included in the model.
Since p-value < 0.05, we reject null hypothesis and at least one of the \(\beta\) is non-zero.
Inference of slope
Given all variables, which ones are significant predictors for number of crews?
summary(model1)
##
## Call:
## lm(formula = crew ~ age + tonnage + passengers + length + cabins +
## passden, data = cruise)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7700 -0.4881 -0.0938 0.4454 7.0077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5213400 1.0570350 -0.493 0.62258
## age -0.0125449 0.0141975 -0.884 0.37832
## tonnage 0.0132410 0.0118928 1.113 0.26732
## passengers -0.1497640 0.0475886 -3.147 0.00199 **
## length 0.4034785 0.1144548 3.525 0.00056 ***
## cabins 0.8016337 0.0892227 8.985 9.84e-16 ***
## passden -0.0006577 0.0158098 -0.042 0.96687
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9819 on 151 degrees of freedom
## Multiple R-squared: 0.9245, Adjusted R-squared: 0.9215
## F-statistic: 308 on 6 and 151 DF, p-value: < 2.2e-16
Given all other variables in the model, passengers
, length
and cabins
have p-value < 0.05, so they are significant predictors.
Based on the multiple regression model, there are number of variables that influence the crew on board the ship. Having a bigger ship with lot of cabins would definitely increase the number of crews, without coming up with a model. In the future study, I would have like to have a variable with a quality of service. I would like to know if quality of service influence the crew.
Data Source: [http://www.truecruse.com]
Multiple Regression Notes: [https://www2.stat.duke.edu/courses/Summer13/sta104.01-1/slides/unit7lec2H.pdf]
Linear Regression Notes: [https://htmlpreview.github.io/?https://github.com/jbryer/IS606Fall2015/blob/master/Pages/Linear_Regression_SAT.html]