File Organization

class: center, middle, inverse, title-slide

.title[
# File Organization
]
.subtitle[
## in Database Systems
]
.author[
### Awais Shah
]
.date[
### (2024-06-22)
]

---

# Table of Contents

- What is file organization
  - Definition.  
- Data Hierarchy
- CRUD
- Types of file organization  
  1. Physical File Organization
      - Sequential  
      - Heap  
      - Hashing
      - RAID 
  2. Logical File Organization

---
class: inverse, center, middle

# Introduction

---
# What is file organization

## File: 
- A file is a collection of related data stored in a computer's storage device.
- In databases, a file typically refers to a collection of records or data units.

## File Organization:
- File organization refers to the way data is stored in a file to optimize access and efficiency.
- It determines the structure and method used to store and retrieve records in a database.
- Effective file organization improves data retrieval speed, storage efficiency, and data integrity.
---

class: center, middle
### Data Hierarchy
![Hierarchy](https://i.imgur.com/T9Egut7.png)

---
# CRUD 
CRUD is an acronym from the world of computer programming and refers to the four functions considered necessary to implement a persistent storage application
## C - create
## R - Read
## U - Update
## D - Delete

---
class: center, middle, inverse
# File Organization Types
### 1. Physical File Organization
---

# Physical File Organization

Physical file organization deals with how data is physically stored on storage media.

### Methods of Physical file organization
  - Sequential  
  - Heap  
  - Hashing
  - Clustered 
  - RAID 
---

# Physical File Organization

### Sequential
- This method stores the records in files in sequential order, one after another in a series.
  - Like a sequence of books on a bookshelf. 
- To access these files, we have to search through the whole sequence until we reach our desired file in *O(n)*.
![](https://i.imgur.com/MGK5voJ.png)
R1,...Rn represents records in a file
---
# Physical File Organization
### Sequential Types
There are 2 types of sequential file organizations.

#### a. Pile file method.
- In this method, records are stored in sequential order, one after another, and they are inserted at the end of the file in the same order in which we insert them in the table using the SQL query.
- The records are unordered i.e. the files are randomly pushed after one another, which makes accessing costly in terms of time.
- In case of modification or deletion of any record, it is first searched throughout the file, and when it is found the updation or deletion operation is performed.
---
# Physical File Organization

- since we traverse the whole file to search for the target element the method works in O(n) time complexity.

![](https://i.imgur.com/Rx1qXDn.png)
In this file, the new R7 record will be inserted at the end of the file
---
# Physical File Organization
### b. Sorted file method
- In sorted file method, the file has to be kept in a sorted order all the time.
- The file is sorted after every delete,insert, and update operation on the basis of a primary key or another reference. 
- Insertion of the new record is done by adding the new record at the end of the file, after which the file is sorted in ascending or descending order based on the requirements.
- giving the insertion operation a time complexity of O(nlogn) used for sorting the file. 
---
### Sorted file method
![](https://i.imgur.com/BXE7Xec.png)
- Since the file is sorted, we can apply the binary search algorithm for updation and deletion. The complexity will be *O(nlogn)*
---
## Sequential file organization
### Pros:
- Simplest method to organize files.
- Very fast at accessing large amounts of data.
- A cost effective method as it can be implemented on magnetic tapes.

### Cons:
- Huge traversal cost because of O(n) complexity.
- In case of sorted file method, traversal cost is reduced to O(logn) but it is still costlier because it has to sort the file after every delete, update, and insert operation.
- The main requirement of this method is that the records must be of same size, which is difficult to implement in most real-world database systems.

---
# Physical File Organization

## Heap

- In Heap, records are inserted in a sequential manner, but unlike the sequential file method, the data blocks are not allocated sequentially DBMS can choose any data block for the record to be inserted. 
- There is no ordering of records in heap file organization once the data block is full, the next record is stored in the new data block, which might not be the next data block.

---
#Heap Organization
<img src="https://i.imgur.com/MP58oQ8.png" alt="Heap Organization" width="80%" style="display: block; margin: auto;">

---
# Heap Insertion
![](https://i.imgur.com/SuVRKOd.png)
---
# Physical File Organization

## Heap (Cont.)
- If we want to search, update or delete the data in heap file organization, then we need to traverse the data from staring of the file till we get the requested record.

- If the database is very large then searching, updating or deleting of record will be time-consuming because there is no sorting or ordering of records.
- We need to check all the data until we get the requested record.
---
## Heap file organization
### Pros:
- It is a very good method of file organization for bulk insertion. 
- In case of a small database, fetching and retrieving of records is faster than the sequential record.

### Cons:
- Since the method takes linear traversal for accessing the records,it takes time to search or modify the record.
- Memory block wastage.

---

# Physical File Organization

## Hash
- The method of arranging and storing records in a database using a hash function.
- Hashing is a better technique compared to sequential because we can access any record with constant time complexity of O(1) using its primary key.
- It is also called a mapping technique because it maps larger values smaller values
---
## Physical File Organization
### Hashing Method

#### **Search Key**
- A key through which we want to search a record. e.g. a primary key.

#### Hash Table
- A data structure where the hashed values are inserted

#### Hash Function
- A mathematical function which converts a key into a smaller value which is then stored in the hash table.

Common Hash Functions: K mod 10, k mod n, Mid square, Folding method

---
# Hashing Example
![](https://i.imgur.com/MQwrE3f.png)
---
## Hash file organization
### Pros:
- Efficient method in all CRUD commands because of its time complexity of O(1).
- Because of its speed and efficiency, it is used in large databases like ticket booking, online banking, e-commerce etc.
- Hash file organization in DBMS can handle multiple transactions at the same time because all records are independent, and multiple records can be accessed at the same time.
---
## Hash file organization

### Cons:
- Data loss because suppose the hashing key used is some non-prime attribute, say the name of the employee in an employee table, then for the same name, the same bucket address will be generated and hence one of them will override the other which will cause the data loss.

- Since there is no order in the arrangement of the memory addresses, it is costly to search some records in a particular range. 
  - For example, searching for students having marks 20 to 30 will not be an efficient operation because the memory addresses are scattered randomly.
---

# Physical File Organization

## Clustered
- A method to store related records together on disk.
- Two or more records/tables are combined into a single file based on the clustered key or hash clusters. 
- These files contain two or more tables in the same memory block, and all of them are combined using a single clustered key/hash key to a single table.
---
## Clustered Example
In this example, we have 2 files, The student file and the course file. If wee want to know which students are enrolled in which course, we need to join the tables based on the clustered index, which in this case is CourseID
![](https://i.imgur.com/jRhk06h.png)
---
## Clustered Example Cont.
<img src="https://i.imgur.com/aYvn9jN.png" alt="Heap Organization" width="65%" style="display: block; margin: auto;">
This operation will save time for us as we just need to run the query for the combined table and no longer need to use the join operation every time you want this information.

---
# Clustered Cont.
Let's understand when we need to use clustering in tables.

- Whenever we have a one-to-many relationship between the tables, then we opt for the clustered file organization method in DBMS. 
- In the above example, one course can be opted by many students; hence there is a 1:M relationship between the course and student table.
- Whenever we need to use the join operation very frequently for joining the tables of the database, then we may consider clustering those tables.

---
## Clustered file organization
### Pros:
- The cluster file organization is used when there is a frequent request for joining the tables with same joining condition.
- It provides the efficient result when there is a 1:M mapping between the tables.

### Cons:
- This method has the low performance for the very large database.
- If there is any change in joining condition, then this method cannot use. If we change the condition of joining then traversing the file takes a lot of time.
- This method is not suitable for a table with a 1:1 condition.
---
# Physical File Organization

## RAID

- Stands for *Redundant Array of Independent Disks*.
- Distributes data across multiple disks for redundancy and performance.
- Various RAID levels (0, 1, 5, 6, 10) offer different balances of performance and redundancy.

---
# Physical File Organization
### RAID 0 (Disk striping)

<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 20px;">
  <div>
    <p>
RAID 0 splits data across any number of disks allowing higher data throughput.    </p>
    <p>
An individual file is read from multiple disks giving it access to the speed and capacity of all of them.    </p>
      This type of RAID does not provide redundancy or fault tolerance, meaning that if one disk fails, all data is lost. It's mainly used for tasks that require high-speed data access, such as video editing or high-performance computing.
  </div>
  
  <div style="text-align: right;">
    <img src="https://i.imgur.com/wdTDUMF.png" alt="Heap Organization" style="max-width: 100%; height: auto;">

</div>
  
</div>

---
## Physical File Organization
### RAID 1 (Disk Mirroring)

<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 20px;">
  <div>
    <p>
RAID 1 writes and reads identical data to pairs of drives. This process is often called data mirroring and it’s a primary function is to provide redundancy   </p>
    <p>
If any of the disks in the array fails, the system can still access data from the remaining disk(s).    </p>
      RAID 1 is the easiest way to create failover storage.
  </div>
  
  <div style="text-align: right;">
    <img src="https://i.imgur.com/YTuVoOb.png" alt="Heap Organization" style="max-width: 100%; height: auto;">

</div>
  
</div>

---
## Physical File Organization
### RAID 10 (Striping + Mirroring)

<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 20px;">
  <div>
    <p>
Combines the mirroring of RAID 1 with the striping of RAID 0. Or in other words, it combines the redundancy of RAID 1 with the increased performance of RAID 0.  </p>
    <p>
Pros: Very high performance. Fault tolerance.
   </p>
Cons: Lower usable capacity/High cost. Limited scalability
  </div>
  
  <div style="text-align: right;">
    <img src="https://i.imgur.com/S5jVmid.png" alt="Heap Organization" style="max-width: 100%; height: auto;">

</div>
  
</div>

---
## Physical File Organization
### RAID 5 (Striping with parity):

<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 20px;">
  <div>
    <p>
RAID 5 stripes data blocks across multiple disks like RAID 0, however, it also stores parity information (Small amount of data that can accurately describe larger amounts of data) which is used to recover the data in case of disk failure.</p>
    <p>
This level offers both speed (data is accessed from multiple disks) and redundancy as parity data is stored across all of the disks.    </p>
  </div>
  
  <div style="text-align: right;">
    <img src="https://i.imgur.com/nzkl7hH.png" alt="Heap Organization" style="max-width: 100%; height: auto;">

</div>
  
</div>

---
class: inverse, center, middle
# Logical File Organization
---
# Types of File Organization

## 2. Logical File Organization

Logical file organization deals with how data is logically structured and accessed by users and applications.
## Key Concepts

- **Data Models**: Hierarchical, Network, Relational, Object-Oriented
- **Schema**: Defines the logical structure of the database.
- **Views**: Customized presentations of data.
- **Normalization**: Process of reducing redundancy and improving data integrity.
- **Entities and Relationships**: Defines objects in the database and their relationships.

---

# Summary

- File organization is crucial for efficient data storage and retrieval.
- Physical file organization involves how data is physically stored, with various methods like sequential, heap, hashing, clustered, and RAID.
- Logical file organization focuses on the abstract structure and logical relationships of data, promoting data independence, integrity, and flexible access.

---
class: center,middle,inverse
# Thank You