library(ggplot2)
options(warn=-1)

Read dataset

fish_data <- read.csv("fish_data.csv")

Plot 1: Scatter Plot

I want to understand the impact of pH of the water with the life span of fish. The color of each point is determined by the habitat type where the fish were observed. As both of them pH of the water andlife span are numerical variables, I think scatter plot will best describe the actual relationship. Observed a decreasing trend in life span as the pH of water increases to 8. that mean as the pH of water increases towards 8, the life span of fish tends to decrease.

ggplot(fish_data, aes(x = ph_of_water, y = life_span,color = habitat)) +
  geom_point(size = 1, alpha = 0.7) +
  labs(title = "Scatter Plot",
       color = "Habitat") +
  theme_minimal()

Plot 2: Histogram

Fish prefer water with a pH around 6.5 to 7 in idle water, rivers, and slow-moving water. Ponds have the highest frequency at pH 7 and 7.5 to around 8. Lakes show the highest frequency at pH 6.5, 7, and 7.5. This suggests that these habitats have water conditions suitable for the fish that live in them..


# Create histograms and distribution curves for ph_of_water by habitat
ggplot(fish_data, aes(x = ph_of_water, fill = habitat)) +
  geom_histogram(bins = 10, alpha = 0.7, position = "identity") +
  facet_wrap(~habitat, scales = "free") +
  labs(title = "Histogram and Distribution Curve of pH of Water by Habitat",
       x = "pH of Water",
       y = "Frequency",
       caption = "Source: Fish Data") +
  scale_fill_manual(values = c("#66c2a5", "#fc8d62", "#8da0cb", "#e78ac3", "#a6d854")) +
  theme_minimal()

Plot 3 : Scatter plot with color-coded points

Analysis of Average Length and Average Weight by Habitat.

Habitat Variations- The points across different habitats show a similar distribution, and there are no evident clusters or groups that would suggest habitat-specific patterns in the relationship between average length and average weight.

Outliers- No outliers or individual data points stand out as deviating significantly from the overall trend.

Spread of Points- The spread of points appears relatively consistent among habitats, indicating a comparable variability in the relationship between average length and average weight across all observed environments.

ggplot(fish_data, aes(x = average_length, y = average_weight, color = habitat)) +
  geom_point(size = 1, alpha = 0.5) +
  labs(title = "Scatter Plot",
       x = "Average Length",
       y = "Average Weight",
       color = "Habitat") +
  theme_minimal()

Plot 4: Bar plot of habitat counts

The bar plot illustrates the distribution of habitat categories in the dataset, providing insights into the frequency of each habitat type. Lakes and ponds appear to be the most common, with higher counts reflected by taller bars. Idlewater, while present, is relatively less common based on the shorter bar height.

ggplot(fish_data, aes(x = habitat)) +
 geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Bar Plot of Habitat Counts",
       x = "Habitat",
       y = "Count") +
  theme_minimal()

Plot 5: Boxplot with datapoints on them

Here we can see the average weight of fish on habitat idlewater, lakes and rivers are more then 10 where average weight of ponds and slowmoving waters are equal 10.

Define a custom color palette

my_palette <- c("#440154", "#31688e", "#35b779", "#fde725")
ggplot(fish_data, aes(x = habitat, y = average_weight, color = habitat)) +
  geom_boxplot(aes(group = habitat), fill = "lightgray", alpha = 0.8, width = 0.5) +
  geom_jitter(position = position_jitter(width = 0.3), size = 0.5, alpha = 0.7) +
  stat_summary(
    fun = mean,
    geom = "point",
    shape = 3,
    size = 4,
    position = position_dodge(width = 0.5),
    color = "red"
  ) +
  labs(title = "Boxplot of Average Weight by Habitat",
       x = "Habitat",
       y = "Average Weight",
       color = "Habitat") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  theme(
    plot.title = element_text(size = 12, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 12),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 12),
    plot.background = element_rect(fill = "white"),
    panel.background = element_rect(fill = "white"),
    legend.background = element_rect(fill = "white"),
    panel.grid.major = element_line(color = "lightgray", size = 0.2),
    panel.grid.minor = element_blank(),
    axis.line = element_line(color = "black", size = 0.2),
    legend.key = element_rect(fill = "white"),
    legend.key.size = unit(0.5, "cm"),
    legend.key.width = unit(0.5, "cm"),
    legend.spacing.x = unit(0.2, "cm"),
    legend.spacing.y = unit(0.1, "cm"),
    legend.box.margin = margin(0, 0, 0, 0)
  ) 
Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
Please use the `linewidth` argument instead.

Plot 6: Ridge plot average weight by habitat

Here peak are the central tendency of average weights. Peak of slowmoving waters,lakes and rivers are almost 10.Peak of ponds are less then 10. Peak of idewater are more then 10. Wider ridges indicate higher variability. Here wide of all the habitat are almost same that means they donot have any significanct difference in the variable.

create a colour plot

my_palette <- c("#66c2a5", "#fc8d62", "#8da0cb", "#e78ac3", "#a6d854")
ggplot(fish_data, aes(x = average_weight, y = habitat, fill = habitat)) +
  geom_density_ridges(alpha = 0.5, rel_min_height = 0.01) +
  scale_fill_manual(values = c("#66c2a5", "#fc8d62", "#8da0cb", "#e78ac3", "#a6d854")) +
  labs(title = "Ridge Plot of Average Weight by Habitat",
       x = "Average Weight",
       y = "Habitat") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  theme(
    plot.title = element_text(size = 12, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 12),
    legend.title = element_blank(),
    legend.text = element_text(size = 12),
    panel.background = element_rect(fill = "white"),
    panel.grid.major = element_line(color = "lightgray", size = 0.2),
    panel.grid.minor = element_blank(),
    axis.line = element_line(color = "black", size = 0.2)
  )

Plot 7: Violin plot of lifespan by habitat

The median is represented by the central point or thickened region within each violin. It indicates the middle value of the life span distribution for a specific habitat. Here median of idlewater is around 5, lakes, ponds, rivers are around 20. And slowmoving waters are more then 20.

Create a color palette for the plot

my_palette <- c("#66c2a5", "#fc8d62", "#8da0cb", "#e78ac3", "#a6d854")

Create the violin plot

ggplot(fish_data, aes(x = factor(habitat), y = life_span, fill = factor(habitat))) +
  geom_violin(trim = FALSE, scale = "width", width = 0.8, alpha = 0.5) +
  geom_jitter(width = 0.2, aes(color = factor(habitat)), size = 1, alpha = 0.5) +
  scale_fill_manual(values = my_palette) +
  scale_color_manual(values = my_palette) +
  theme_minimal() +
  labs(title = "Violin Plot of Life Span by Habitat",
       x = "Habitat",
       y = "Life Span") +
  theme(legend.position = "bottom",
        legend.title = element_blank(),
        legend.key = element_blank(),
        legend.background = element_blank()) +
  theme(plot.title = element_text(size = 12, face = "bold"),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 12),
        legend.text = element_text(size = 12),
        panel.background = element_rect(fill = "white"),
        panel.grid.major = element_line(color = "lightgray", size = 0.2),
        panel.grid.minor = element_blank(),
        axis.line = element_line(color = "black", size = 0.2)
  )

---
title: "R Notebook"
output:
  html_notebook: default
  pdf_document: default
  html_document:
    df_print: paged
---

```{r}
options(warn=-1)
library(ggplot2)
library(viridis)
library(ggridges)

```


# Read dataset

```{r}
fish_data <- read.csv("fish_data.csv")
```


# Plot 1: Scatter Plot

I want to understand the impact of pH of the water with the life span of fish. The color of each point is determined by the habitat type where the fish were observed. As both of them pH of the water andlife span  are numerical variables, I think scatter plot will best describe the actual relationship. Observed a decreasing trend in life span as the pH of water increases to 8. that mean as the pH of water increases towards 8, the life span of fish tends to decrease.

```{r}
ggplot(fish_data, aes(x = ph_of_water, y = life_span,color = habitat)) +
  geom_point(size = 1, alpha = 0.7) +
  labs(title = "Scatter Plot",
       color = "Habitat") +
  theme_minimal()
```

# Plot 2: Histogram

Fish prefer water with a pH around 6.5 to 7 in idle water, rivers, and slow-moving water. Ponds have the highest frequency at pH 7 and 7.5 to around 8. Lakes show the highest frequency at pH 6.5, 7, and 7.5. This suggests that these habitats have water conditions suitable for the fish that live in them..


```{r}

# Create histograms and distribution curves for ph_of_water by habitat
ggplot(fish_data, aes(x = ph_of_water, fill = habitat)) +
  geom_histogram(bins = 10, alpha = 0.7, position = "identity") +
  facet_wrap(~habitat, scales = "free") +
  labs(title = "Histogram and Distribution Curve of pH of Water by Habitat",
       x = "pH of Water",
       y = "Frequency",
       caption = "Source: Fish Data") +
  scale_fill_manual(values = c("#66c2a5", "#fc8d62", "#8da0cb", "#e78ac3", "#a6d854")) +
  theme_minimal()

```


# Plot 3 : Scatter plot with color-coded points 
#### Analysis of Average Length and Average Weight by Habitat. 
Habitat Variations- The points across different habitats show a similar distribution, and there are no evident clusters or groups that would suggest habitat-specific patterns in the relationship between average length and average weight. 

Outliers- No outliers or individual data points stand out as deviating significantly from the overall trend.

Spread of Points- The spread of points appears relatively consistent among habitats, indicating a comparable variability in the relationship between average length and average weight across all observed environments.

```{r}
ggplot(fish_data, aes(x = average_length, y = average_weight, color = habitat)) +
  geom_point(size = 1, alpha = 0.5) +
  labs(title = "Scatter Plot",
       x = "Average Length",
       y = "Average Weight",
       color = "Habitat") +
  theme_minimal()
```

# Plot 4: Bar plot of habitat counts

The bar plot illustrates the distribution of habitat categories in the dataset, providing insights into the frequency of each habitat type. Lakes and ponds appear to be the most common, with higher counts reflected by taller bars. Idlewater, while present, is relatively less common based on the shorter bar height.

```{r}
ggplot(fish_data, aes(x = habitat)) +
 geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Bar Plot of Habitat Counts",
       x = "Habitat",
       y = "Count") +
  theme_minimal()
```

# Plot 5: Boxplot with datapoints on them
Here we can see the average weight of fish on habitat idlewater, lakes and rivers are more then 10 where average weight of ponds and slowmoving waters are equal 10.


Define a custom color palette
```{r}
my_palette <- c("#440154", "#31688e", "#35b779", "#fde725")
```

```{r}
ggplot(fish_data, aes(x = habitat, y = average_weight, color = habitat)) +
  geom_boxplot(aes(group = habitat), fill = "lightgray", alpha = 0.8, width = 0.5) +
  geom_jitter(position = position_jitter(width = 0.3), size = 0.5, alpha = 0.7) +
  stat_summary(
    fun = mean,
    geom = "point",
    shape = 3,
    size = 4,
    position = position_dodge(width = 0.5),
    color = "red"
  ) +
  labs(title = "Boxplot of Average Weight by Habitat",
       x = "Habitat",
       y = "Average Weight",
       color = "Habitat") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  theme(
    plot.title = element_text(size = 12, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 12),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 12),
    plot.background = element_rect(fill = "white"),
    panel.background = element_rect(fill = "white"),
    legend.background = element_rect(fill = "white"),
    panel.grid.major = element_line(color = "lightgray", size = 0.2),
    panel.grid.minor = element_blank(),
    axis.line = element_line(color = "black", size = 0.2),
    legend.key = element_rect(fill = "white"),
    legend.key.size = unit(0.5, "cm"),
    legend.key.width = unit(0.5, "cm"),
    legend.spacing.x = unit(0.2, "cm"),
    legend.spacing.y = unit(0.1, "cm"),
    legend.box.margin = margin(0, 0, 0, 0)
  ) 
```

# Plot 6: Ridge plot average weight by habitat
Here peak are the central tendency of average weights. Peak of slowmoving waters,lakes and rivers are almost 10.Peak of ponds are less then 10. Peak of idewater are more then 10.
Wider ridges indicate higher variability. Here wide of all the habitat are almost same that means they donot have any significanct difference in the variable.

create a colour plot
```{r}
my_palette <- c("#66c2a5", "#fc8d62", "#8da0cb", "#e78ac3", "#a6d854")
```

```{r}
ggplot(fish_data, aes(x = average_weight, y = habitat, fill = habitat)) +
  geom_density_ridges(alpha = 0.5, rel_min_height = 0.01) +
  scale_fill_manual(values = c("#66c2a5", "#fc8d62", "#8da0cb", "#e78ac3", "#a6d854")) +
  labs(title = "Ridge Plot of Average Weight by Habitat",
       x = "Average Weight",
       y = "Habitat") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  theme(
    plot.title = element_text(size = 12, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 12),
    legend.title = element_blank(),
    legend.text = element_text(size = 12),
    panel.background = element_rect(fill = "white"),
    panel.grid.major = element_line(color = "lightgray", size = 0.2),
    panel.grid.minor = element_blank(),
    axis.line = element_line(color = "black", size = 0.2)
  )
```

# Plot 7: Violin plot of lifespan by habitat

The median is represented by the central point or thickened region within each violin. It indicates the middle value of the life span distribution for a specific habitat. Here median of idlewater is around 5, lakes, ponds, rivers  are around 20. And slowmoving waters are more then 20.

Create a color palette for the plot
```{r}
my_palette <- c("#66c2a5", "#fc8d62", "#8da0cb", "#e78ac3", "#a6d854")

```

Create the violin plot
```{r}
ggplot(fish_data, aes(x = factor(habitat), y = life_span, fill = factor(habitat))) +
  geom_violin(trim = FALSE, scale = "width", width = 0.8, alpha = 0.5) +
  geom_jitter(width = 0.2, aes(color = factor(habitat)), size = 1, alpha = 0.5) +
  scale_fill_manual(values = my_palette) +
  scale_color_manual(values = my_palette) +
  theme_minimal() +
  labs(title = "Violin Plot of Life Span by Habitat",
       x = "Habitat",
       y = "Life Span") +
  theme(legend.position = "bottom",
        legend.title = element_blank(),
        legend.key = element_blank(),
        legend.background = element_blank()) +
  theme(plot.title = element_text(size = 12, face = "bold"),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 12),
        legend.text = element_text(size = 12),
        panel.background = element_rect(fill = "white"),
        panel.grid.major = element_line(color = "lightgray", size = 0.2),
        panel.grid.minor = element_blank(),
        axis.line = element_line(color = "black", size = 0.2)
  )
```






















