- Initial setup and Configure the data set.
- Load the data set file in variable hotel_data files.
- Data set - Hotels : This data comes from an open hotel booking
demand dataset from Antonio, Almeida and Nunes.
A list of at least 3 columns (or values) in your data which are
unclear until you read the documentation.
There are few
columns , has been filter out from data set which were unclear until I
read the documentation.
1. meal: The meal column may contain
codes or abbreviations for different meal plans offered for given
hotel.
2. market_segment: It’s unclear what specific segments
or categories the market segment column refers to without consulting the
documentation.
3. deposit_type: The column “deposit_type”
might contain encoded values indicating the type of deposit required for
booking.
(Top five and bottom five sample data from Hotel dataset )
## hotel meal market_segment deposit_type agent company
## 1 Resort Hotel BB Direct No Deposit NULL NULL
## 2 Resort Hotel BB Direct No Deposit NULL NULL
## 3 Resort Hotel BB Direct No Deposit NULL NULL
## 4 Resort Hotel BB Corporate No Deposit 304 NULL
## 5 Resort Hotel BB Online TA No Deposit 240 NULL
## 6 Resort Hotel BB Online TA No Deposit 240 NULL
## hotel meal market_segment deposit_type agent company
## 119385 City Hotel BB Offline TA/TO No Deposit 394 NULL
## 119386 City Hotel BB Offline TA/TO No Deposit 394 NULL
## 119387 City Hotel BB Online TA No Deposit 9 NULL
## 119388 City Hotel BB Online TA No Deposit 9 NULL
## 119389 City Hotel BB Online TA No Deposit 89 NULL
## 119390 City Hotel HB Online TA No Deposit 9 NULL
Why do you think they chose to encode the data the way they did?
What could have happened if you didn’t read the documentation?
The data may have been encoded to efficiently store categorical
information and reduce storage space. Encoding also helps maintain
consistency and prevent errors in data entry. Without reading the
documentation, one might misinterpret the encoded values, leading to
incorrect analysis and conclusions.
At least one element or your data that is unclear even after reading the
documentation
You may need to do some digging, but is there anything about the data
that your documentation does not explain?
1.agent The meaning and significance of the ‘agent’
column are unclear. The documentation may not provide sufficient details
about the role of agents or their identification numbers. Additionaly
upon reviewing dataset for agent, it might require further clarification
about the this column as this column is discribed as the ID of the
travel agent and It is marked as “NULL” for many entries. It’s unclear
whether this signifies bookings made without the involvement of a travel
agency or if there’s another reason for the null values. Understanding
the significance of these null values could provide insights into the
booking process and the role of travel agencies in the dataset. Further
exploration or additional documentation may be necessary to clarify this
aspect of the data.
2.company After seeing the dataset,
company column was unclear because It is not explicitly defined within
the context of the dataset or the provided column descriptions. It’s not
clear whether it refers to the company making the booking, the company
the guest is associated with, or some other entity. Further
clarification or additional documentation may be needed to understand
the significance of this column.
## Agent_Present Count
## 1 TRUE 119390
Build a visualization which uses a column of data that is affected
by the issue you brought up in above. In this visualization, find a way
to highlight the issue, and explain what is unclear and why it might be
unclear?
To visualize the issue with the “agent” column where
many entries are marked as “NULL”, I have created a bar chart showing
the distribution of bookings based on whether It have a non-null value
in the “agent” column or not. This will help highlight the proportion of
bookings where the agent information is missing.
Visualization - Presence of Agent Information in Bookings

Do you notice any significant risks? If so, what could you do to
reduce negative consequences? According to understaning
on dataset for agent column, there are significant risks associated with
NULL values in the “agent” column of the dataset
1. Potential
Bias in Analysis: If the “NULL” values represent missing data due to
a systematic reason e.g. certain types of bookings not being associated
with an agent, the analysis may be biased. For example, if bookings made
through certain channels are less likely to have an agent associated
with them, any analysis involving the “agent” column may be skewed.
2. Loss of Information: NULL values in the “agent” column
indicate missing information about the booking process. This loss of
information could hinder the ability to accurately understand and model
factors affecting bookings, cancellations, or other outcomes.
To
reduce the negative consequences of NULL values in the “agent” column,
some steps can be taken like
2.1 Investigate the Reason for NULL
Values: Insight: It is crucial to understand why “NULL”
values are present in the “agent” column. This might involve consulting
with data providers, examining data collection processes, or conducting
data audits to identify any systematic reasons for missing values.
Significance: Understanding the reasons for NULL values helps to
contextualize their presence in the dataset. It allows us to assess
whether the missing data is random or systematic and whether it
introduces bias into the analysis.
Further Questions:
-
Are NULL values randomly distributed throughout the dataset, or do they
occur more frequently in specific subsets of data?
-
Are there any patterns or trends associated with bookings that have NULL
values in the “agent” column?
-
Are there any differences in booking characteristics or outcomes between
bookings with and without agent information?
2.2 Impute Missing Values: Insight: Depending on the
reason for the missing values, it may be possible to impute them using
appropriate methods such as mean imputation, mode imputation, or
predictive imputation. However, imputation should be done cautiously to
avoid introducing bias or inaccuracies into the dataset.
Significance: Imputation helps maintain the integrity of the
dataset and ensures that analyses involving the “agent” column are not
biased due to missing data. It allows us to leverage all available
information to draw meaningful insights.
Further Questions:
-
What is the distribution of imputed values compared to observed values
in the “agent” column?
-
How sensitive are our analyses to the choice of imputation method?
-
Are there any outliers or anomalies in the imputed values that need
further investigation?
2.3 Consider Multiple Perspectives: Insight: When
analyzing the dataset, it’s essential to consider multiple perspectives
and sensitivity analyses to assess the robustness of findings in the
presence of missing data. This might involve conducting separate
analyses with and without the “agent” column or exploring alternative
ways to account for missing values in the analysis.
Significance: By examining data from different angles and under
various assumptions, we gain a more comprehensive understanding of the
dataset’s limitations and uncertainties. It helps ensure the reliability
and validity of our conclusions.
Further Questions:
-
How do the results of analyses change when excluding observations with
missing values in the “agent” column?
-
Are there any interactions or confounding variables that may influence
the relationship between the “agent” column and other variables of
interest?
-
Are there any alternative methods for handling missing data that may be
more appropriate for our specific analysis?
2.4 Document Limitations: Insight: Transparent
documentation of data limitations, including the presence of NULL values
in the “agent” column, ensures that readers understand the scope and
reliability of the analysis.
Significance: Documenting
limitations helps to maintain the integrity and reproducibility of the
research findings. It provides transparency about potential biases or
uncertainties in the data and analysis methods.
Further
Questions:
-
Are there any additional limitations or uncertainties in the dataset
that need to be documented?
-
How can we effectively communicate the presence of missing data and its
potential impact on the interpretation of results to stakeholders or
readers?
-
Are there any steps we can take to mitigate the impact of data
limitations on the validity of our conclusions?
So by addressing these questions and conducting thorough
investigations, we can enhance our understanding of the dataset,
minimize biases, and ensure the reliability of our analyses and
interpretations.
Thank you !!!!!