M3 Project Report on R
ALY6000:Introduction to Analytics
Northeastern University
Professor: Dr. Dee Chiluiza, PhD

By: Zeeshan Ahmad Ansari

Date of Submission: 09 April, 2024


Library

#The report utilizes a set of libraries for various data processing and visualization tasks.

library(tidyverse)
library(readxl)
library(kableExtra)
library(dplyr)
library(knitr)
library(readr)
library(RColorBrewer)
library(magrittr)
library(FSA)
library(plotly)

#Dataset_Employed_in_this_M3_Report

M3Data = read_excel("inchBio.xlsx")


Introduction

The provided dataset appears to contain information about different fish species, including their IDs, lengths, weights, and a boolean value indicating whether they have scales or not.The provided dataset contains 505 row entries and 6 column entries. With this dataset, we can perform various types of analysis to gain insights into the characteristics of these fish species. Here are some potential analyses and information we can derive from this dataset:

  1. Species Description:

    We can identify and describe each fish species based on their IDs. This could include researching and presenting details about the habitat, behavior, and other ecological characteristics of each species.

  2. Descriptive Statistics:

    We can compute basic statistics for the lengths and weights of the fish, such as the mean, median, minimum, maximum, and standard deviation. This can provide us with a general idea of the average size and weight of each species.

  3. Size and Weight Distribution:

    We can use graphs such as histograms and pie charts to depict the distribution of lengths and weights within the dataset. This can assist us to comprehend the size ranges of the various species.

  4. Scales and Species:

    We can analyze how many fish have scales versus don’t have scales by calculating the scale frequency for different species. We can also compare the average length and weight between scaled and scale-less fish.

  5. Species Comparison:

    We can compare the characteristics of different species by calculating and comparing the mean lengths and weights. Identifying which species tend to be larger or smaller.

  6. Data Visualization:

    We can create various types of charts and plots to visually represent the data, such as scatter plots of length vs. weight, bar charts of species frequency, or box plots to show the distribution of sizes for each species(Dee Chiluiza, 2022).

Reference:

  1. Dee Chiluiza. (2022, June 25). RPubs. https://rpubs.com/Dee_Chiluiza/home
  2. Dee Chiluiza. (2022, June 25). RPubs. https://rpubs.com/Dee_Chiluiza/796492
  3. Dee Chiluiza. (2022, June 25). RPubs. https://rpubs.com/Dee_Chiluiza/scatterplot


Analysis

TASK_1
This task is divided into three part.


Task_1A

In this task we used summary command to get information about the dataset and improved the presentation using kable and kable_styling commands.

data_summary <- summary(M3Data)

kable(data_summary, format = "html", align = "l") %>%
  column_spec(1, bold = TRUE)%>%
  kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")
netID fishID species length weight scale
Min. : 4.00 Min. : 7.0 Length:505 Min. : 30.47 Min. : 0.9027 Mode :logical
1st Qu.: 12.00 1st Qu.:169.0 Class :character 1st Qu.: 63.90 1st Qu.: 3.0711 FALSE:194
Median :101.00 Median :569.0 Mode :character Median :153.04 Median : 60.2374 TRUE :311
Mean : 78.68 Mean :487.5 NA Mean :160.52 Mean : 129.3169 NA
3rd Qu.:113.00 3rd Qu.:762.0 NA 3rd Qu.:230.11 3rd Qu.: 193.4106 NA
Max. :206.00 Max. :915.0 NA Max. :432.58 Max. :1071.8813 NA

Observations to Task_1A:

The provided code employs the summary command to generate a concise summary of the dataset variables. The resulting table includes important statistics for each variable, such as minimum, maximum, median, mean, and quartile values. This facilitates a quick understanding of the range and distribution of numerical attributes.

The kable and kable_styling commands are utilized to enhance the presentation of the summary table. By applying formatting options, such as bold headers and striped rows, the table becomes more readable and aesthetically pleasing. The use of a slategrey background for the header row with white text enhances its visibility.

The summary table provides valuable insights into the dataset’s characteristics. For instance, the range of values for attributes like length and weight is evident from the minimum and maximum values. The quartile values offer information about the data distribution and spread, aiding in understanding the variability within the dataset.

The table showcases the data types and modes for each variable, helping users quickly identify whether a variable is numerical or character-based. This is particularly useful for determining the type of analysis that can be performed on each attribute.

The NA entries indicate the presence of missing values in certain columns. This highlights potential data gaps that may require further investigation and handling during the analysis process.

The scale variable, represented by the Mode entry logical, indicates that it contains binary data (likely indicating the presence or absence of scales). This provides a preliminary understanding of the nature of this categorical variable.

The mean and median values provide insights into the central tendency of numerical variables. Comparing these values can indicate whether the data distribution is skewed or symmetric.

The quartile values (1st, 2nd, and 3rd) allow for a deeper understanding of the data’s spread. They assist in identifying potential outliers and assessing the concentration of data within specific ranges.

By comparing the statistics for different variables, such as ‘length’ and ‘weight,’ it becomes possible to compare their central tendencies and ranges. This can aid in making informed decisions about data processing and analysis.

The summary table serves as a starting point for exploratory data analysis, helping analysts quickly grasp key features of the dataset. It highlights potential areas of interest for further investigation and guides subsequent analytical steps.

Task_1B
Usage of glimpse command

#Usage of glimpse command
glimpse(M3Data)
## Rows: 505
## Columns: 6
## $ netID   <dbl> 5, 16, 16, 21, 24, 24, 101, 101, 101, 101, 102, 102, 102, 102,…
## $ fishID  <dbl> 137, 208, 209, 218, 268, 269, 532, 534, 535, 537, 626, 627, 62…
## $ species <chr> "Black Crappie", "Black Crappie", "Black Crappie", "Black Crap…
## $ length  <dbl> 268.75365, 298.37579, 275.35851, 154.17039, 332.43289, 309.698…
## $ weight  <dbl> 276.278409, 380.552210, 260.684962, 47.293176, 580.883012, 441…
## $ scale   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…

Observations to TASK_1B:

The glimpse function in R is used to provide a concise and informative summary of the structure of a dataset. It is particularly helpful for quickly understanding the data types of variables, the number of observations (rows), and the first few values of each variable.

When we use the glimpse function, it displays the following information for each variable in the dataset:

  1. The name of the variable (column).
  2. The data type of the variable (e.g., numeric, character, factor, etc.).
  3. A preview of the first few values in the variable.

TASK_1C

Summary Function:

The built-in summary function in the R programming language serves as a valuable tool for generating concise overviews of data distribution within a dataset, particularly for numeric variables. By employing this function, users gain access to key statistical measures for each numeric column, facilitating a comprehensive understanding of the dataset’s characteristics. These measures encompass the minimum value, the value at the first quartile, the median (representing the second quartile), the arithmetic mean, the value at the third quartile, and the maximum value associated with each numeric variable.

In instances where categorical variables are present, the summary function showcases the frequency of occurrence for every distinct category. This functionality proves essential for unraveling the underlying structure of categorical data, shedding light on the prevalence of different categories within the dataset.

The significance of the summary function extends to its capacity for unveiling essential aspects of data distribution. This includes insights into the central tendency, providing a grasp of where the data tends to cluster, and the spread of numerical data, indicating the variability or dispersion within the dataset. Additionally, for categorical data, the function illuminates the distribution of different categories, enabling an understanding of the proportional representation of each category.

Advantages:

  1. Provides a quick overview of numeric and categorical variables.

  2. Displays common statistics, which helps in understanding the distribution of the data.

  3. Compact and easy to interpret.

Limitations:

  1. Only works with numeric and categorical data types.

  2. Doesn’t show individual data points, making it hard to spot outliers or unusual values.

  3. Limited information on data types and column names.

Glimpse Function:

Embedded within the R dplyr package, the glimpse function stands out as a valuable asset for obtaining a more exhaustive and intricate portrayal of an entire dataset. This function delves beyond surface-level insights, supplying a range of essential details that contribute to a deeper understanding. Among its offerings are revelations about the data type attributed to each individual column, a preview of the initial rows within the dataset, and a concise presentation enumerating the precise count of rows and columns encapsulated within.

Noteworthy is the capacity of the glimpse function to furnish rapid yet comprehensive insights. These encompass not only a comprehensive list of data types that define the structure of the dataset, but also a glimpse into the actual content by showcasing the foremost observations present in the dataset’s inception. Furthermore, it serves as a convenient reference for swiftly accessing the names of each column.

Appreciating the significance of this function extends to its facilitation of efficient preliminary data assessment. Its role in delivering a swift overview of the dataset’s data types, column nomenclature, and initial data points streamlines the exploratory process. This proves indispensable for data analysts and researchers aiming to promptly gauge the dataset’s characteristics and commence their investigation with a solid foundation of knowledge.

In essence, the glimpse function emerges as a powerful instrument within the dplyr package, enabling users to transcend the superficial and embrace a comprehensive understanding of their dataset. Its provision of essential metadata and a preview of actual data values empowers users to embark on their data analysis journeys armed with critical insights.

Advantages:

  1. Displays data types, column names, and the first few rows of the dataset, providing a comprehensive overview.

  2. Helps identify the number of rows and columns in the dataset.

  3. Easily highlights missing values or inconsistencies in data types.

Limitations:

  1. Does not provide statistical summaries like summary (e.g., mean, median, quartiles).

  2. Shows only a limited number of rows (usually 10 by default), which may not be sufficient for large datasets.

  3. Not as concise as summary for numeric summaries.

In summary, summary is useful for providing a quick statistical summary of numeric and categorical variables, while glimpse is valuable for understanding the overall structure of the dataset, including data types, column names, and the initial rows. Both functions have their advantages, and the choice between them depends on the specific information you need about the dataset. In practice, it’s often beneficial to use both functions together to gain a comprehensive understanding of the data.

TASK_2
Incorporation of inline R code within the text

# Column number
num_columns <- ncol(M3Data)

# Row number
num_rows <- nrow(M3Data)

# Using inline R code to display rows and columns
cat("Number of columns:", num_columns, "\nNumber of rows:", num_rows)
## Number of columns: 6 
## Number of rows: 505

Observations:

The above code shows how, we used inline r code to show their present values f

TASK_3
Performing data analysis on variables

# Choose the species, length, and weight columns, and extract the initial five as well as the final five records from the data.

inchBio_subset <- M3Data %>%
  select(species, length, weight) %>%
  headtail(n = 5)

kable(inchBio_subset, digits = 2 , format = "html", align = "l") %>%
row_spec(0, background = "slategrey", bold = TRUE, color = "white")%>%
  kable_styling(full_width = TRUE, "striped",font_size = 14)
species length weight
1 Black Crappie 268.75 276.28
2 Black Crappie 298.38 380.55
3 Black Crappie 275.36 260.68
4 Black Crappie 154.17 47.29
5 Black Crappie 332.43 580.88
501 Yellow Perch 223.26 114.20
502 Yellow Perch 86.37 5.87
503 Yellow Perch 93.06 8.16
504 Yellow Perch 82.16 5.82
505 Yellow Perch 72.45 3.04


Observations:

In the above we learned how to use pipes (%>%) from the magrittr tool, we focused on three specific columns: species, length, and weight. This helped us narrow down our analysis. We then used a function called headtail (from the FSA library) to show the first five and last five examples in this new dataset. Additionally, we explored a different method of presenting our findings by making a neat and organized table using the kable and kable_styling tools. This way, we can better understand and present the information in our dataset.

TASK_4
Conducting an analysis to describe the statistical characteristics of the length and weight variables.

# Conducting an analysis to describe the statistical characteristics of the length and weight variables.
M3mean_l <- mean(M3Data$length)
M3median_l <- median(M3Data$length)
M3sd_l <- sd(M3Data$length)
M3quantiles_l <- quantile(M3Data$length, probs = c(0.25, 0.5, 0.75))

M3mean_wt <- mean(M3Data$weight)
M3median_wt <- median(M3Data$weight)
M3sd_wt <- sd(M3Data$weight)
M3quantiles_wt <- quantile(M3Data$weight, probs = c(0.25, 0.5, 0.75))

# Make vectors for the names of the columns and rows.

columns <- c("Mean", 
             "Median", 
             "Std Dev", 
             "25th_quant", 
             "50th_quant (Median)", 
             "75th_quant")

rows <- c("Length", 
          "Weight")

# Matrix to store data

stats_matrix <- matrix(
  c(M3mean_l, 
    M3median_l, 
    M3sd_l, 
    M3quantiles_l, 
    M3mean_wt, 
    M3median_wt, 
    M3sd_wt, 
    M3quantiles_wt),
  nrow = 2,
  dimnames = list(rows, columns)
)


kable(stats_matrix, digits = 2, format = "html", align = "l")%>%
  row_spec(0 , background = "slategrey", bold = TRUE, color = "white")%>%
  kable_styling(full_width = TRUE, "striped",font_size = 14)
Mean Median Std Dev 25th_quant 50th_quant (Median) 75th_quant
Length 160.52 100.29 153.04 129.32 167.57 60.24
Weight 153.04 63.90 230.11 60.24 3.07 193.41


Observations:

The code above computes basic descriptive statistics for the “length” and “weight” variables in the “M3Data” dataset. The derived statistics, including mean, median, standard deviation, and selected quantiles, are then grouped into a matrix table for better presentation using kable and kable_styling.

  1. The mean “Length” is larger than the median, suggesting a positively skewed distribution with some longer lengths pulling the mean higher.

  2. The “Weight” distribution is highly skewed, as indicated by the large difference between the mean and median. The median of 3.07 is much lower than the mean of 153.04.

  3. In both variables, the standard deviation indicates substantial variability from the mean.

  4. The 25th percentile for “Length” (129.32) indicates that 25% of the observations have lengths lower than this value, while the 75th percentile (167.57) indicates that 75% have lengths lower than this value.

  5. For “Weight,” the 25th percentile (60.24) is higher than the median (3.07), indicating that a significant portion of the data has relatively higher weights.

TASK_5
Displaying mean length and weight of 7 Species in matrix format

mean_data <- M3Data %>%
  group_by(species) %>%
  summarise(Mean_length = round(mean(length), 2), Mean_Wt = round(mean(weight), 2))


matrix_table <- mean_data %>%
  kable(format = "html", caption = "Species average length and weight", align = "l") %>%
row_spec(0, background = "slategrey", bold = TRUE, color = "white")%>%
  kable_styling(full_width = TRUE, "striped",font_size = 14)


matrix_table
Species average length and weight
species Mean_length Mean_Wt
Black Crappie 276.08 360.35
Bluegill 145.62 90.03
Bluntnose Minnow 64.19 3.03
Iowa Darter 49.43 1.88
Largemouth Bass 299.28 353.69
Pumpkinseed 135.08 99.44
Yellow Perch 190.29 107.49


Observations:

The abpve code calculates the mean length and mean weight for each of the seven species in the inchBio dataset. The calculated means are rounded to two decimal places. The results are then presented in a matrix table. Here are some observations about the code and the obtained data:

  1. The table provides a clear overview of the mean length and mean weight for each of the seven species. This presentation makes it easy to compare and contrast the characteristics of different species based on these two variables.

  2. The data reveals substantial variation in mean length and mean weight among the species. For example, Black Crappie has the highest mean length (276.08) and mean weight (360.35), while Iowa Darter has the lowest mean length (49.43) and mean weight (1.88).

  3. The table highlights the considerable size differences among the species. Some species, like Bluntnose Minnow, Iowa Darter, and Pumpkinseed have relatively smaller mean lengths and weights, while others, like Black Crappie and Largemouth Bass are larger on average.

  4. These mean values provide insights into the ecological characteristics of each species. For instance, the relatively small size of Iowa Darter and Bluntnose Minnow could suggest adaptations to specific habitats and ecological niches.

TASK_6
The objective is to generate a table that displays the occurrences, accumulated occurrences, likelihood, and accumulated likelihood of the species variable.

species_freq_table <- table(M3Data$species)


freq_df <- data.frame(Species = names(species_freq_table), 
                      Frequency = as.vector(species_freq_table))



# Compute the accumulated frequencies, likelihoods, and cumulative likelihoods.
total_species <- sum(freq_df$Frequency)
freq_df <- freq_df %>% 
  mutate(Cumulative_Frequency = cumsum(Frequency),
         Probability = Frequency / total_species,
         Cumulative_Probability = cumsum(Probability))


kable(freq_df, format = "html", digits = 2, align = "l") %>%
row_spec(0, background = "slategrey", bold = TRUE, color = "white")%>%
kable_styling(full_width = TRUE, "striped",font_size = 14)
Species Frequency Cumulative_Frequency Probability Cumulative_Probability
Black Crappie 25 25 0.05 0.05
Bluegill 208 233 0.41 0.46
Bluntnose Minnow 100 333 0.20 0.66
Iowa Darter 31 364 0.06 0.72
Largemouth Bass 90 454 0.18 0.90
Pumpkinseed 13 467 0.03 0.92
Yellow Perch 38 505 0.08 1.00

Observations:

In the above code, we created a frequency table for the variable species in the `“M3Data”inchBio dataset and then converts this frequency table into a data frame and we calculated some statistics such as cumulative frequencies, probabilities, and cumulative probabilities. The results are presented in a tabular format. Here are observations about the code and the obtained data:

  1. The frequency table presents the number of occurrences for each species in the dataset. This information provides a clear count of how many observations are associated with each species.

  2. The cumulative frequency represents the total of frequencies as we move through the species in the table. For instance, the cumulative frequency for Bluegill is 233, which is the sum of its frequency and the frequencies of the previous species.

  3. The probability column indicates the likelihood of encountering each species, calculated by dividing the frequency by the total number of species. Cumulative probability shows the increasing likelihood as we move through the species in the table.

  4. The cumulative probability for “Bluntnose Minnow” (0.66) indicates that the cumulative likelihood of encountering species up to Bluntnose Minnow is 66%. Similarly, the cumulative probability for Yellow Perch (1.00) suggests that this is the last species in the dataset.

  5. The frequencies vary widely among species, with Bluegill being the most frequently observed species (208 occurrences), while Pumpkinseed and Yellow Perch are less commonly observed.

TASK_7
Producing a pie chart to depict probabilities and a bar plot to illustrate cumulative probabilities.

# Producing a pie chart to depict probabilities
pie_chart <- plot_ly(
  freq_df,
  labels = ~Species,
  values = ~Probability,
  type = "pie",
  textinfo = "label+percent",
  title = list(
    text = "Probability of Each Species",
    font = list(size = 20, weight = "bold"))
)

subplot(pie_chart)
# Define custom colors for each species
custom_colors <- c("#43a2ca", "slategrey", "red", "green", "lightblue", "orange", "yellow")

# Producing a pie bar plot to illustrate cumulative probabilities
bar_plot <- barplot(
  freq_df$Cumulative_Probability,
  names.arg = freq_df$Species,
  main = "Accumulated Likelihood for Each Species",
  xlab = "Species",
  ylab = "Cumulative Probability",
  cex.names = 0.5,
  col = custom_colors)

# Display the values on top of the bars
text(bar_plot, freq_df$Cumulative_Probability, labels = sprintf("%.2f", freq_df$Cumulative_Probability), pos = 3)


Observations:

Observations from the Pie Chart:

  1. The pie chart provides a visual representation of the probability distribution of each species. Each slice of the pie corresponds to a species, and its size reflects the proportion of the total probability it represents.

  2. The species Bluegill has the largest slice, indicating that it has the highest probability among the species.

  3. Bluntnose Minnow and Largemouth Bass have similar probabilities, while the other species have lower probabilities.

  4. The legend displays the labels of the species along with their corresponding percentages.

Observations from the Bar Plot:

  1. The bar plot illustrates the cumulative probabilities of each species. The x-axis represents the species, while the y-axis represents the cumulative probability. The bars are colored differently for each species, aiding easy visual differentiation.

  2. Yellow Perch has the highest cumulative probability among the species, followed by Pumpkiseed and Largemouth Bass.

  3. The values on top of each bar show the cumulative probability up to two decimal places, aiding precise interpretation.

Both visualizations effectively convey insights about the distribution of probabilities and cumulative probabilities among the different species in the dataset. The pie chart provides a clear overview of the probability distribution, while the bar plot highlights the cumulative probabilities and enables comparison between species.


CONCLUSION

In this report, we embarked on a comprehensive exploration of a dataset containing information about various fish species. By leveraging a range of libraries and tools in R, we delved into different aspects of data processing, analysis, and visualization to uncover insights into the characteristics and distribution of these species.

The initial steps of our analysis involved gaining an understanding of the dataset’s structure and dimensions. We utilized libraries such as tidyverse, readxl, and kableExtra to enhance data presentation. Through the summary function, we obtained key statistical measures for the numeric variables, shedding light on central tendencies and variabilities. Additionally, the glimpse function enabled us to obtain a detailed overview of the dataset’s structure, including data types and initial observations.

Subsequently, we directed our attention to specific columns, namely species, length, and weight, and selectively extracted and presented the first and last five records using the headtail function. This approach allowed us to focus on a subset of the data for closer examination.

In the pursuit of deeper insights, we calculated descriptive statistics for the length and weight variables. The resulting table provided mean, median, standard deviation, and quantiles, enabling a comprehensive grasp of the distribution and variability within these attributes.

Species-based analysis unveiled intriguing patterns. By calculating mean lengths and weights for each species, we were able to highlight distinct size variations among the different fish. This information provides a glimpse into the ecological characteristics and adaptations of each species.

A significant portion of our analysis centered around the construction of a frequency table for the species variable. This table, along with cumulative frequencies, probabilities, and cumulative probabilities, facilitated a holistic understanding of species occurrences and their likelihood. Our visualizations, including a pie chart displaying probabilities and a bar plot illustrating cumulative probabilities, vividly showcased the distribution and relative significance of each species.

In summation, this report showcases the power of R as a versatile tool for data analysis and visualization. Through libraries such as dplyr, plotly, and others, we navigated through data exploration, statistical analysis, and graphical representation. The insights gained from this analysis contribute to our understanding of fish species characteristics, their distributions, and the ecosystem they inhabit. This report serves as a testament to the utility of R in uncovering valuable insights from datasets, furthering our knowledge in ecological and biological domains.


BIBLIOGRAPHY

  1. Dee Chiluiza. (2022, June 25).RPubs. https://rpubs.com/Dee_Chiluiza/home
  2. Dee Chiluiza. (2022, June 25). RPubs. https://rpubs.com/Dee_Chiluiza/796492
  3. Dee Chiluiza. (2022, June 25). RPubs. https://rpubs.com/Dee_Chiluiza/scatterplot
  4. DataDaft. (2019, October 31). dplyr: filter [Video].YouTube. https://www.youtube.com/watch?v=BkmYBBM2SdQ
  5. Hao Zhu. (2021, February 19). https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html#Table_Styles
  6. Plotly Technologies Inc. (2022). Plotly: Create Interactive Web Graphics via ‘plotly.js’. R package version 4.10.2. https://CRAN.R-project.org/package=plotly
  7. Chiluiza, D. (n.d.). RPubs: Pie Charts. Retrieved from https://rpubs.com/Dee_Chiluiza/995745
  8. Wickham, H., & François, R. (2021). dplyr: A grammar of data manipulation. R package version 1.0.7. https://CRAN.R-project.org/package=dplyr
  9. Boone, E. (2017, October 2). Tidyverse in R… Select and Group By [Video]. YouTube. https://www.youtube.com/watch?v=timZ6erM7Z4


Appendix

This report contains an R Markdown file named as follows Ansari_ALY6000Project_M3.Rmd