M3 Project Report
on R
ALY6000:Introduction to Analytics
Northeastern University
Professor: Dr. Dee Chiluiza,
PhD
By: Zeeshan Ahmad Ansari
Date of Submission: 09 April, 2024
Library
#The report utilizes a set of libraries for various data processing and visualization tasks.
library(tidyverse)
library(readxl)
library(kableExtra)
library(dplyr)
library(knitr)
library(readr)
library(RColorBrewer)
library(magrittr)
library(FSA)
library(plotly)
#Dataset_Employed_in_this_M3_Report
M3Data = read_excel("inchBio.xlsx")
Introduction
The provided dataset appears to contain information about different fish
species, including their IDs, lengths, weights, and a boolean value
indicating whether they have scales or not.The provided dataset contains
505 row entries and 6 column entries. With
this dataset, we can perform various types of analysis to gain insights
into the characteristics of these fish species. Here are some potential
analyses and information we can derive from this dataset:
We can identify and describe each fish species based on their IDs. This could include researching and presenting details about the habitat, behavior, and other ecological characteristics of each species.
We can compute basic statistics for the lengths and weights of the fish, such as the mean, median, minimum, maximum, and standard deviation. This can provide us with a general idea of the average size and weight of each species.
We can use graphs such as histograms and pie charts to depict the distribution of lengths and weights within the dataset. This can assist us to comprehend the size ranges of the various species.
We can analyze how many fish have scales versus don’t have scales by calculating the scale frequency for different species. We can also compare the average length and weight between scaled and scale-less fish.
We can compare the characteristics of different species by calculating and comparing the mean lengths and weights. Identifying which species tend to be larger or smaller.
We can create various types of charts and plots to visually represent
the data, such as scatter plots of length vs. weight, bar charts of
species frequency, or box plots to show the distribution of sizes for
each species(Dee Chiluiza, 2022).
Reference:
Analysis
TASK_1
This task is divided into three
part.
Task_1A
In this task we used summary command to get information
about the dataset and improved the presentation using kable
and kable_styling commands.
data_summary <- summary(M3Data)
kable(data_summary, format = "html", align = "l") %>%
column_spec(1, bold = TRUE)%>%
kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
row_spec(0, bold = TRUE, background = "slategrey" , color = "white")
| netID | fishID | species | length | weight | scale | |
|---|---|---|---|---|---|---|
| Min. : 4.00 | Min. : 7.0 | Length:505 | Min. : 30.47 | Min. : 0.9027 | Mode :logical | |
| 1st Qu.: 12.00 | 1st Qu.:169.0 | Class :character | 1st Qu.: 63.90 | 1st Qu.: 3.0711 | FALSE:194 | |
| Median :101.00 | Median :569.0 | Mode :character | Median :153.04 | Median : 60.2374 | TRUE :311 | |
| Mean : 78.68 | Mean :487.5 | NA | Mean :160.52 | Mean : 129.3169 | NA | |
| 3rd Qu.:113.00 | 3rd Qu.:762.0 | NA | 3rd Qu.:230.11 | 3rd Qu.: 193.4106 | NA | |
| Max. :206.00 | Max. :915.0 | NA | Max. :432.58 | Max. :1071.8813 | NA |
Observations to Task_1A:
The provided code employs the summary command to generate a
concise summary of the dataset variables. The resulting table includes
important statistics for each variable, such as minimum,
maximum, median, mean, and
quartile values. This facilitates a quick understanding of
the range and distribution of numerical attributes.
The kable and kable_styling commands are
utilized to enhance the presentation of the summary table. By applying
formatting options, such as bold headers and striped rows, the table
becomes more readable and aesthetically pleasing. The use of a slategrey
background for the header row with white text enhances its visibility.
The summary table provides valuable insights into the dataset’s
characteristics. For instance, the range of values for attributes like
length and weight is evident from the minimum
and maximum values. The quartile values offer information about the data
distribution and spread, aiding in understanding the variability within
the dataset.
The table showcases the data types and modes for each variable, helping users quickly identify whether a variable is numerical or character-based. This is particularly useful for determining the type of analysis that can be performed on each attribute.
The NA entries indicate the presence of missing values in
certain columns. This highlights potential data gaps that may require
further investigation and handling during the analysis process.
The scale variable, represented by the Mode
entry logical, indicates that it contains binary data
(likely indicating the presence or absence of scales). This provides a
preliminary understanding of the nature of this categorical variable.
The mean and median values provide insights into the central tendency of numerical variables. Comparing these values can indicate whether the data distribution is skewed or symmetric.
The quartile values (1st, 2nd, and 3rd) allow for a deeper understanding of the data’s spread. They assist in identifying potential outliers and assessing the concentration of data within specific ranges.
By comparing the statistics for different variables, such as ‘length’ and ‘weight,’ it becomes possible to compare their central tendencies and ranges. This can aid in making informed decisions about data processing and analysis.
The summary table serves as a starting point for exploratory data analysis, helping analysts quickly grasp key features of the dataset. It highlights potential areas of interest for further investigation and guides subsequent analytical steps.
Task_1B
Usage of glimpse command
#Usage of glimpse command
glimpse(M3Data)
## Rows: 505
## Columns: 6
## $ netID <dbl> 5, 16, 16, 21, 24, 24, 101, 101, 101, 101, 102, 102, 102, 102,…
## $ fishID <dbl> 137, 208, 209, 218, 268, 269, 532, 534, 535, 537, 626, 627, 62…
## $ species <chr> "Black Crappie", "Black Crappie", "Black Crappie", "Black Crap…
## $ length <dbl> 268.75365, 298.37579, 275.35851, 154.17039, 332.43289, 309.698…
## $ weight <dbl> 276.278409, 380.552210, 260.684962, 47.293176, 580.883012, 441…
## $ scale <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
Observations to TASK_1B:
The glimpse function in R is used to provide a concise and informative summary of the structure of a dataset. It is particularly helpful for quickly understanding the data types of variables, the number of observations (rows), and the first few values of each variable.
When we use the glimpse function, it displays the following information for each variable in the dataset:
TASK_1C
Summary Function:
The built-in summary function in the R programming language
serves as a valuable tool for generating concise overviews of data
distribution within a dataset, particularly for numeric variables. By
employing this function, users gain access to key statistical measures
for each numeric column, facilitating a comprehensive understanding of
the dataset’s characteristics. These measures encompass the minimum
value, the value at the first quartile, the median (representing the
second quartile), the arithmetic mean, the value at the third quartile,
and the maximum value associated with each numeric variable.
In instances where categorical variables are present, the summary function showcases the frequency of occurrence for every distinct category. This functionality proves essential for unraveling the underlying structure of categorical data, shedding light on the prevalence of different categories within the dataset.
The significance of the summary function extends to its capacity for unveiling essential aspects of data distribution. This includes insights into the central tendency, providing a grasp of where the data tends to cluster, and the spread of numerical data, indicating the variability or dispersion within the dataset. Additionally, for categorical data, the function illuminates the distribution of different categories, enabling an understanding of the proportional representation of each category.
Advantages:
Provides a quick overview of numeric and categorical variables.
Displays common statistics, which helps in understanding the distribution of the data.
Compact and easy to interpret.
Limitations:
Only works with numeric and categorical data types.
Doesn’t show individual data points, making it hard to spot outliers or unusual values.
Limited information on data types and column names.
Glimpse Function:
Embedded within the R dplyr package, the glimpse function
stands out as a valuable asset for obtaining a more exhaustive and
intricate portrayal of an entire dataset. This function delves beyond
surface-level insights, supplying a range of essential details that
contribute to a deeper understanding. Among its offerings are
revelations about the data type attributed to each individual column, a
preview of the initial rows within the dataset, and a concise
presentation enumerating the precise count of rows and columns
encapsulated within.
Noteworthy is the capacity of the glimpse function to furnish rapid yet comprehensive insights. These encompass not only a comprehensive list of data types that define the structure of the dataset, but also a glimpse into the actual content by showcasing the foremost observations present in the dataset’s inception. Furthermore, it serves as a convenient reference for swiftly accessing the names of each column.
Appreciating the significance of this function extends to its facilitation of efficient preliminary data assessment. Its role in delivering a swift overview of the dataset’s data types, column nomenclature, and initial data points streamlines the exploratory process. This proves indispensable for data analysts and researchers aiming to promptly gauge the dataset’s characteristics and commence their investigation with a solid foundation of knowledge.
In essence, the glimpse function emerges as a powerful instrument within the dplyr package, enabling users to transcend the superficial and embrace a comprehensive understanding of their dataset. Its provision of essential metadata and a preview of actual data values empowers users to embark on their data analysis journeys armed with critical insights.
Advantages:
Displays data types, column names, and the first few rows of the dataset, providing a comprehensive overview.
Helps identify the number of rows and columns in the dataset.
Easily highlights missing values or inconsistencies in data types.
Limitations:
Does not provide statistical summaries like summary (e.g., mean, median, quartiles).
Shows only a limited number of rows (usually 10 by default), which may not be sufficient for large datasets.
Not as concise as summary for numeric summaries.
In summary, summary is useful for providing a quick statistical summary of numeric and categorical variables, while glimpse is valuable for understanding the overall structure of the dataset, including data types, column names, and the initial rows. Both functions have their advantages, and the choice between them depends on the specific information you need about the dataset. In practice, it’s often beneficial to use both functions together to gain a comprehensive understanding of the data.
TASK_2
Incorporation of inline R code within
the text
# Column number
num_columns <- ncol(M3Data)
# Row number
num_rows <- nrow(M3Data)
# Using inline R code to display rows and columns
cat("Number of columns:", num_columns, "\nNumber of rows:", num_rows)
## Number of columns: 6
## Number of rows: 505
Observations:
The above code shows how, we used inline r code to show their present
values f
TASK_3
Performing data analysis on
variables
# Choose the species, length, and weight columns, and extract the initial five as well as the final five records from the data.
inchBio_subset <- M3Data %>%
select(species, length, weight) %>%
headtail(n = 5)
kable(inchBio_subset, digits = 2 , format = "html", align = "l") %>%
row_spec(0, background = "slategrey", bold = TRUE, color = "white")%>%
kable_styling(full_width = TRUE, "striped",font_size = 14)
| species | length | weight | |
|---|---|---|---|
| 1 | Black Crappie | 268.75 | 276.28 |
| 2 | Black Crappie | 298.38 | 380.55 |
| 3 | Black Crappie | 275.36 | 260.68 |
| 4 | Black Crappie | 154.17 | 47.29 |
| 5 | Black Crappie | 332.43 | 580.88 |
| 501 | Yellow Perch | 223.26 | 114.20 |
| 502 | Yellow Perch | 86.37 | 5.87 |
| 503 | Yellow Perch | 93.06 | 8.16 |
| 504 | Yellow Perch | 82.16 | 5.82 |
| 505 | Yellow Perch | 72.45 | 3.04 |
Observations:
In the above we learned how to use pipes (%>%) from the magrittr tool, we focused on three specific columns: species, length, and weight. This helped us narrow down our analysis. We then used a function called headtail (from the FSA library) to show the first five and last five examples in this new dataset. Additionally, we explored a different method of presenting our findings by making a neat and organized table using the kable and kable_styling tools. This way, we can better understand and present the information in our dataset.
TASK_4
Conducting an analysis to describe the
statistical characteristics of the length and weight
variables.
# Conducting an analysis to describe the statistical characteristics of the length and weight variables.
M3mean_l <- mean(M3Data$length)
M3median_l <- median(M3Data$length)
M3sd_l <- sd(M3Data$length)
M3quantiles_l <- quantile(M3Data$length, probs = c(0.25, 0.5, 0.75))
M3mean_wt <- mean(M3Data$weight)
M3median_wt <- median(M3Data$weight)
M3sd_wt <- sd(M3Data$weight)
M3quantiles_wt <- quantile(M3Data$weight, probs = c(0.25, 0.5, 0.75))
# Make vectors for the names of the columns and rows.
columns <- c("Mean",
"Median",
"Std Dev",
"25th_quant",
"50th_quant (Median)",
"75th_quant")
rows <- c("Length",
"Weight")
# Matrix to store data
stats_matrix <- matrix(
c(M3mean_l,
M3median_l,
M3sd_l,
M3quantiles_l,
M3mean_wt,
M3median_wt,
M3sd_wt,
M3quantiles_wt),
nrow = 2,
dimnames = list(rows, columns)
)
kable(stats_matrix, digits = 2, format = "html", align = "l")%>%
row_spec(0 , background = "slategrey", bold = TRUE, color = "white")%>%
kable_styling(full_width = TRUE, "striped",font_size = 14)
| Mean | Median | Std Dev | 25th_quant | 50th_quant (Median) | 75th_quant | |
|---|---|---|---|---|---|---|
| Length | 160.52 | 100.29 | 153.04 | 129.32 | 167.57 | 60.24 |
| Weight | 153.04 | 63.90 | 230.11 | 60.24 | 3.07 | 193.41 |
Observations:
The code above computes basic descriptive statistics for the “length” and “weight” variables in the “M3Data” dataset. The derived statistics, including mean, median, standard deviation, and selected quantiles, are then grouped into a matrix table for better presentation using kable and kable_styling.
The mean “Length” is larger than the median, suggesting a positively skewed distribution with some longer lengths pulling the mean higher.
The “Weight” distribution is highly skewed, as indicated by the large difference between the mean and median. The median of 3.07 is much lower than the mean of 153.04.
In both variables, the standard deviation indicates substantial variability from the mean.
The 25th percentile for “Length” (129.32) indicates that 25% of the observations have lengths lower than this value, while the 75th percentile (167.57) indicates that 75% have lengths lower than this value.
For “Weight,” the 25th percentile (60.24) is higher than the median (3.07), indicating that a significant portion of the data has relatively higher weights.
TASK_5
Displaying mean length and weight of 7
Species in matrix format
mean_data <- M3Data %>%
group_by(species) %>%
summarise(Mean_length = round(mean(length), 2), Mean_Wt = round(mean(weight), 2))
matrix_table <- mean_data %>%
kable(format = "html", caption = "Species average length and weight", align = "l") %>%
row_spec(0, background = "slategrey", bold = TRUE, color = "white")%>%
kable_styling(full_width = TRUE, "striped",font_size = 14)
matrix_table
| species | Mean_length | Mean_Wt |
|---|---|---|
| Black Crappie | 276.08 | 360.35 |
| Bluegill | 145.62 | 90.03 |
| Bluntnose Minnow | 64.19 | 3.03 |
| Iowa Darter | 49.43 | 1.88 |
| Largemouth Bass | 299.28 | 353.69 |
| Pumpkinseed | 135.08 | 99.44 |
| Yellow Perch | 190.29 | 107.49 |
Observations:
The abpve code calculates the mean length and mean weight for each of
the seven species in the inchBio dataset. The calculated
means are rounded to two decimal places. The results are then presented
in a matrix table. Here are some observations about the code and the
obtained data:
The table provides a clear overview of the mean length and mean weight for each of the seven species. This presentation makes it easy to compare and contrast the characteristics of different species based on these two variables.
The data reveals substantial variation in mean length and mean weight
among the species. For example, Black Crappie has the
highest mean length (276.08) and mean weight
(360.35), while Iowa Darter has the lowest
mean length (49.43) and mean weight (1.88).
The table highlights the considerable size differences among the
species. Some species, like Bluntnose Minnow,
Iowa Darter, and Pumpkinseed have relatively
smaller mean lengths and weights, while others, like
Black Crappie and Largemouth Bass are larger
on average.
These mean values provide insights into the ecological characteristics
of each species. For instance, the relatively small size of
Iowa Darter and Bluntnose Minnow could suggest
adaptations to specific habitats and ecological niches.
TASK_6
The objective is to generate a table that
displays the occurrences, accumulated occurrences, likelihood, and
accumulated likelihood of the species variable.
species_freq_table <- table(M3Data$species)
freq_df <- data.frame(Species = names(species_freq_table),
Frequency = as.vector(species_freq_table))
# Compute the accumulated frequencies, likelihoods, and cumulative likelihoods.
total_species <- sum(freq_df$Frequency)
freq_df <- freq_df %>%
mutate(Cumulative_Frequency = cumsum(Frequency),
Probability = Frequency / total_species,
Cumulative_Probability = cumsum(Probability))
kable(freq_df, format = "html", digits = 2, align = "l") %>%
row_spec(0, background = "slategrey", bold = TRUE, color = "white")%>%
kable_styling(full_width = TRUE, "striped",font_size = 14)
| Species | Frequency | Cumulative_Frequency | Probability | Cumulative_Probability |
|---|---|---|---|---|
| Black Crappie | 25 | 25 | 0.05 | 0.05 |
| Bluegill | 208 | 233 | 0.41 | 0.46 |
| Bluntnose Minnow | 100 | 333 | 0.20 | 0.66 |
| Iowa Darter | 31 | 364 | 0.06 | 0.72 |
| Largemouth Bass | 90 | 454 | 0.18 | 0.90 |
| Pumpkinseed | 13 | 467 | 0.03 | 0.92 |
| Yellow Perch | 38 | 505 | 0.08 | 1.00 |
In the above code, we created a frequency table for the variable
species in the `“M3Data”inchBio dataset and
then converts this frequency table into a data frame and we calculated
some statistics such as cumulative frequencies, probabilities, and
cumulative probabilities. The results are presented in a tabular format.
Here are observations about the code and the obtained data:
The frequency table presents the number of occurrences for each species in the dataset. This information provides a clear count of how many observations are associated with each species.
The cumulative frequency represents the total of frequencies as we move
through the species in the table. For instance, the cumulative frequency
for Bluegill is 233, which is the sum of its
frequency and the frequencies of the previous species.
The probability column indicates the likelihood of encountering each species, calculated by dividing the frequency by the total number of species. Cumulative probability shows the increasing likelihood as we move through the species in the table.
The cumulative probability for “Bluntnose Minnow” (0.66) indicates that
the cumulative likelihood of encountering species up to
Bluntnose Minnow is 66%. Similarly, the
cumulative probability for Yellow Perch (1.00)
suggests that this is the last species in the dataset.
The frequencies vary widely among species, with Bluegill
being the most frequently observed species (208 occurrences), while
Pumpkinseed and Yellow Perch are less commonly
observed.
TASK_7
Producing a pie chart to depict
probabilities and a bar plot to illustrate cumulative
probabilities.
# Producing a pie chart to depict probabilities
pie_chart <- plot_ly(
freq_df,
labels = ~Species,
values = ~Probability,
type = "pie",
textinfo = "label+percent",
title = list(
text = "Probability of Each Species",
font = list(size = 20, weight = "bold"))
)
subplot(pie_chart)
# Define custom colors for each species
custom_colors <- c("#43a2ca", "slategrey", "red", "green", "lightblue", "orange", "yellow")
# Producing a pie bar plot to illustrate cumulative probabilities
bar_plot <- barplot(
freq_df$Cumulative_Probability,
names.arg = freq_df$Species,
main = "Accumulated Likelihood for Each Species",
xlab = "Species",
ylab = "Cumulative Probability",
cex.names = 0.5,
col = custom_colors)
# Display the values on top of the bars
text(bar_plot, freq_df$Cumulative_Probability, labels = sprintf("%.2f", freq_df$Cumulative_Probability), pos = 3)
Observations:
Observations from the Pie Chart:
The pie chart provides a visual representation of the probability distribution of each species. Each slice of the pie corresponds to a species, and its size reflects the proportion of the total probability it represents.
The species Bluegill has the largest slice, indicating that
it has the highest probability among the species.
Bluntnose Minnow and Largemouth Bass have
similar probabilities, while the other species have lower probabilities.
The legend displays the labels of the species along with their corresponding percentages.
Observations from the Bar Plot:
The bar plot illustrates the cumulative probabilities of each species. The x-axis represents the species, while the y-axis represents the cumulative probability. The bars are colored differently for each species, aiding easy visual differentiation.
Yellow Perch has the highest cumulative probability among
the species, followed by Pumpkiseed and
Largemouth Bass.
The values on top of each bar show the cumulative probability up to two decimal places, aiding precise interpretation.
Both visualizations effectively convey insights about the distribution of probabilities and cumulative probabilities among the different species in the dataset. The pie chart provides a clear overview of the probability distribution, while the bar plot highlights the cumulative probabilities and enables comparison between species.
CONCLUSION
In this report, we embarked on a comprehensive exploration of a dataset containing information about various fish species. By leveraging a range of libraries and tools in R, we delved into different aspects of data processing, analysis, and visualization to uncover insights into the characteristics and distribution of these species.
The initial steps of our analysis involved gaining an understanding of
the dataset’s structure and dimensions. We utilized libraries such as
tidyverse, readxl, and kableExtra
to enhance data presentation. Through the summary function,
we obtained key statistical measures for the numeric variables, shedding
light on central tendencies and variabilities. Additionally, the
glimpse function enabled us to obtain a detailed overview
of the dataset’s structure, including data types and initial
observations.
Subsequently, we directed our attention to specific columns, namely
species, length, and weight, and selectively extracted and presented the
first and last five records using the headtail function.
This approach allowed us to focus on a subset of the data for closer
examination.
In the pursuit of deeper insights, we calculated descriptive statistics for the length and weight variables. The resulting table provided mean, median, standard deviation, and quantiles, enabling a comprehensive grasp of the distribution and variability within these attributes.
Species-based analysis unveiled intriguing patterns. By calculating mean lengths and weights for each species, we were able to highlight distinct size variations among the different fish. This information provides a glimpse into the ecological characteristics and adaptations of each species.
A significant portion of our analysis centered around the construction of a frequency table for the species variable. This table, along with cumulative frequencies, probabilities, and cumulative probabilities, facilitated a holistic understanding of species occurrences and their likelihood. Our visualizations, including a pie chart displaying probabilities and a bar plot illustrating cumulative probabilities, vividly showcased the distribution and relative significance of each species.
In summation, this report showcases the power of R as a versatile tool
for data analysis and visualization. Through libraries such as
dplyr, plotly, and others, we navigated
through data exploration, statistical analysis, and graphical
representation. The insights gained from this analysis contribute to our
understanding of fish species characteristics, their distributions, and
the ecosystem they inhabit. This report serves as a testament to the
utility of R in uncovering valuable insights from datasets, furthering
our knowledge in ecological and biological domains.
BIBLIOGRAPHY
Appendix
This report contains an R Markdown file named as follows
Ansari_ALY6000Project_M3.Rmd