Introduction

The dataset has been taken from Kaggle’s famous Titanic dataset where I will going to see linear regression between age and fare ticket. I want to see if a passenger’s age had any relationship with the affordability of fare.

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
titanic_ds <- read.csv("train.csv", header = TRUE)
glimpse(titanic_ds)
## Observations: 891
## Variables: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,...
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,...
## $ Name        <fct> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra...
## $ Sex         <fct> male, female, female, female, male, male, male, ma...
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ...
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,...
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,...
## $ Ticket      <fct> A/5 21171, PC 17599, STON/O2. 3101282, 113803, 373...
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ...
## $ Cabin       <fct> , C85, , C123, , , E46, , , , G6, C103, , , , , , ...
## $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q,...

Regression Model

model <- lm(Fare ~ Age, titanic_ds)
summary(model)
## 
## Call:
## lm(formula = Fare ~ Age, data = titanic_ds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42.42 -24.49 -17.60   2.33 475.78 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  24.3009     4.4922   5.410 8.64e-08 ***
## Age           0.3500     0.1359   2.575   0.0102 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52.71 on 712 degrees of freedom
##   (177 observations deleted due to missingness)
## Multiple R-squared:  0.009229,   Adjusted R-squared:  0.007837 
## F-statistic: 6.632 on 1 and 712 DF,  p-value: 0.01022

Age describes the fare 0.9229% which is not very significant factor to describe the purchsing of fare. Although there is still positive relationship between age and fare. With the 1 unit increase in age, fare increases slightly by 0.35 units.

The model is :

Fare = 24.3009 + 0.35(Age)

Residual Analysis

par(mfrow=c(2,2))
plot(model)

hist(titanic_ds$Age)
hist(titanic_ds$Fare)

The data is not normal and pattern can be also seen from the plot. Hence, this model is not satisfying the requirements of linear regression and cannot be used for further analysis without fixing the issues. It can be verified from the histogram for ‘Fare’ above.