Speciesfisheries_Statistical-Analysis.knit

ALY6000 Introduction to Analytics
Instructor Name: Prof. Dee Chiluiza, PhD
Northeastern University
Student: Jayakumar Moris Udayakumar
Date: 17 November, 2024

Executive Summary Report

INTRODUCTION:
In this executive summary report, the dataset “speciesfisheries” will be analyzed. The dataset comprises the information about 7 different fish species and its respective length and weight. It has totally 505 observations. By analyzing the dataset, this report would present basic descriptive statistics, Average length and weight, variable species’ probability and frequencies and cumulative probabilities. In a graphical presentation, pie chart and bar plot displayed for cumulative probabilities. To support the analysis and for the guidance, references taken from book written by Bhuman, A (2018), and RPub blog written by Prof. Chiliuza, D, (2023).

ANALYSIS SECTION

Task 1:

Task 1A:

Description: Using code summary obtaining information about the whole “speciesfisheries” dataset and presenting the same using kable ().

netID	fishID	species	length	weight	scale
Min. : 4.00	Min. : 7.0	Length:505	Min. : 30.47	Min. : 0.9027	Mode :logical
1st Qu.: 12.00	1st Qu.:169.0	Class :character	1st Qu.: 63.90	1st Qu.: 3.0711	FALSE:194
Median :101.00	Median :569.0	Mode :character	Median :153.04	Median : 60.2374	TRUE :311
Mean : 78.68	Mean :487.5	NA	Mean :160.52	Mean : 129.3169	NA
3rd Qu.:113.00	3rd Qu.:762.0	NA	3rd Qu.:230.11	3rd Qu.: 193.4106	NA
Max. :206.00	Max. :915.0	NA	Max. :432.58	Max. :1071.8813	NA

Observation: The summary shows descriptive statistics of each variable in the dataset. And for the categorical variables such as species, it shows the length, class and mode; also the other categorical variable “scale” denotes the number of datavalues “True” as 311 and “False” as 194. For the variable “length”, min length of one of the species is 30.47, max length of one of the species is 432.58, average is 160.52, and median length is 153.04. Similarly, descriptive statistics for weight of the species: Min. weight is 0.90 and max. weight is 1071.88, median weight is 60.23 and average weight is 129.31. These are basic observations that are visible from the summary function. It is quick and informative.

Task 1B

Description: Applying glimpse function to get information

#applying glimpse function
glimpse(speciesfisheries)

## Rows: 505
## Columns: 6
## $ netID   <dbl> 5, 16, 16, 21, 24, 24, 101, 101, 101, 101, 102, 102, 102, 102,…
## $ fishID  <dbl> 137, 208, 209, 218, 268, 269, 532, 534, 535, 537, 626, 627, 62…
## $ species <chr> "Black Crappie", "Black Crappie", "Black Crappie", "Black Crap…
## $ length  <dbl> 268.75365, 298.37579, 275.35851, 154.17039, 332.43289, 309.698…
## $ weight  <dbl> 276.278409, 380.552210, 260.684962, 47.293176, 580.883012, 441…
## $ scale   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…

Observation: From the glimpse output, we able to recognize the pattern of the dataset. Aside, the format of actual dataset has transposed as column names writted down the page and it is showing the first few values of each variable. For instance, ‘netID’ variable has values in each rows as “5, 16, 16, etc.” consecutively. Also, it denotes number of rows as 505 (total observations) and available columns 6 in the dataset.

Task 1C

Comparison of the Task 1A and 1B results:

While comparing the task A and B, it resembles like a comparison of descriptive statistics of the dataset and the glimpse of the raw data. The number of rows and columns are accurate and matches with each other. As glimpse() shows just the few details of the dataset, comparison of statistical data through outputs of these tasks is hard. However, we can recognize the pattern of the dataset through glimpse().

Task 2
Description: Using Inline R codes to present the number of columns and row in the dataset

#To calculate the number of rows and columns, creating two objects
speciesfisheriescol <- ncol(speciesfisheries)
speciesfisheriesrow <- nrow(speciesfisheries)

How many variables (columns) does the data set contains?
Columns = 6
How many observations (rows) does the data set contains?
Rows = 505

Observation: From this task, I was able to observe that inline R code is helpful to show actual value of the analysis while writing a report and able to reflect the changes if occurred.

Task 3
Description: Selecting few variables (columns) to perform data analysis.

#isolating variables species, length, weight and then using headtail code to present first 5 and last 5 records of the selected variables

speciesfisheries %>%
  select(species, length, weight)%>%
  headtail(n=5)%>%
  kable(align = "c", digits = 2)%>%
  kable_styling(bootstrap_options = "basic", full_width = NULL, stripe_color = "grey", table.envir = "table", protect_latex = TRUE)

	species	length	weight
1	Black Crappie	268.75	276.28
2	Black Crappie	298.38	380.55
3	Black Crappie	275.36	260.68
4	Black Crappie	154.17	47.29
5	Black Crappie	332.43	580.88
501	Yellow Perch	223.26	114.20
502	Yellow Perch	86.37	5.87
503	Yellow Perch	93.06	8.16
504	Yellow Perch	82.16	5.82
505	Yellow Perch	72.45	3.04

Observation: From the above table, it is easier to observe that we have pulled specific variables such as species, length, and weight from the dataset “speciesfisheries” and presented first 5 and last 5 records of the same using “headtail” code. Output is accurate when compared to raw data. On an average, Black Crappie species’ length and weight is higher than the Yellow Perch. However, Yellow Perch species has ability to grow longer and can weigh heavy as one of the observation shows its length 223,26 and weight 114.20.

Task 4

Description: Evaluating descriptive statistics for the variables “length” and “weight”

# descriptive statistics of variables length and weight

meanlength <- mean(speciesfisheries$length)
meanweight <- mean(speciesfisheries$weight)
medianlength <- median(speciesfisheries$length)
medianweight <- median(speciesfisheries$weight)
sdlength <- sd(speciesfisheries$length)
sdweight <- sd(speciesfisheries$weight)

# create vectors for the data, column names, and row names

col_names = c("length", "weight")
row_names = c("Mean", "Median", "SD")
speciesfisheriesvector = matrix(c(meanlength, meanweight, medianlength, medianweight, sdlength, sdweight), nrow = 3, byrow = TRUE)

# creating matrix

speciesfisheriestable = matrix(speciesfisheriesvector, ncol = 2, dimnames = list(row_names,col_names))

# using kable() to present the table
kable(speciesfisheriestable, align = "c", digits = 2, format = "html")%>%
  kable_styling(bootstrap_options = "basic", full_width = NULL, stripe_color = "black", table.envir = "table", protect_latex = TRUE)

	length	weight
Mean	160.52	129.32
Median	153.04	60.24
SD	100.29	167.57

Observation: From the above task, we able to observe that descriptive statistics such as mean, median and SD of variables “length” and “weight” derived and presented. For instance, variable “length” has average of 160.52 whereas median is 153.04; therefore, the mean is higher than the median with the difference of 7 (approx).

Task 5
Description: Inorder to select individual categories from categorical variable, applying filter function

# filtering individual categories of the categorical variable "Species" and finding mean for the variable length and weight, respectively

blackc = speciesfisheries %>%
  filter(species=="Black Crappie")
    
meanlengthbc = mean(blackc$length)
meanweightbc = mean(blackc$weight)

Blueg = speciesfisheries %>%
  filter(species=="Bluegill")

meanlengthbg = mean(Blueg$length)
meanweightbg = mean(Blueg$weight)  

Bluntnosem = speciesfisheries %>%
  filter(species=="Bluntnose Minnow")

meanlengthbm = mean(Bluntnosem$length)  
meanweightbm = mean(Bluntnosem$weight)

Iowa = speciesfisheries %>%
  filter(species=="Iowa Darter")

meanlengthiw = mean(Iowa$length)
meanweightiw = mean(Iowa$weight)

Largemouth = speciesfisheries %>%
  filter(species=="Largemouth Bass")

meanlengthlb = mean(Largemouth$length)
meanweightlb = mean(Largemouth$weight)

Pumpkin = speciesfisheries %>%
  filter(species=="Pumpkinseed")

meanlengthps = mean(Pumpkin$length)
meanweightps = mean(Pumpkin$weight)

Yellowperch = speciesfisheries %>%
  filter(species=="Yellow Perch")

meanlengthyp = mean(Yellowperch$length)  
meanweightyp = mean(Yellowperch$weight)

# create vectors for the data, column names, and row names

colnamesspecies = c("Avg length", "Avg weight")
rownamesspecies = c("Black Crappie", "Bluegill", "Bluntnose Minnow", "Iowa Darter", "Largemouth Bass", "Pumpkinseed", "Yellow Perch")
sevenspeciesvector = matrix(c(meanlengthbc, meanweightbc, meanlengthbg, meanweightbg, meanlengthbm, meanweightbm, meanlengthiw, meanweightiw, meanlengthlb, meanweightlb, meanlengthps, meanweightps, meanlengthyp, meanweightyp), nrow = 7, byrow = TRUE)

# creating matrix from the vector
sevenspeciestable = matrix(sevenspeciesvector, ncol = 2, dimnames = list(rownamesspecies,colnamesspecies))

# using kable () to present the table
kable(sevenspeciestable, digits = 2, align = "c")%>%
  kable_styling(bootstrap_options = "basic", full_width = NULL, stripe_color = "brown", table.envir = "table", protect_latex = TRUE)

	Avg length	Avg weight
Black Crappie	276.08	360.35
Bluegill	145.62	90.03
Bluntnose Minnow	64.19	3.03
Iowa Darter	49.43	1.88
Largemouth Bass	299.28	353.69
Pumpkinseed	135.08	99.44
Yellow Perch	190.29	107.49

Observation: From the above task and its result, we able to observe that the average length and weight of the individual categories of the categorical variable “Species” found. The species “Largemouth Bass” has the highest average length of 299.28, whereas species “Iowa Darter” has lowest average length of 49.43. On the other hand, Black Crappie seems to be the heaviest species that weighs 360.35 and the species “Iowa Darter” weighs light at 1.87.

Task 6
Description: Creating a table to present the frequencies, cumulative frequencies, probability, and cumulative probability of variable species.

Reference: Bluman, G (2018)

#Creating a name for the table
speciestable <- table(speciesfisheries$species)

#transposing the table
speciestabletp <- data.frame(t(speciestable))

#renaming the column
 speciestabletp <- speciestabletp %>% rename(Frequency = Freq)
 speciestabletp <- speciestabletp %>% rename(speciesName = Var2)

#using mutate function to create new variables for cumulative frequencies, probabilities, and cumulative probability of variable species

speciestabletp <- speciestabletp %>%
  mutate(
    Cumulativefreq = cumsum(Frequency),
    Probability = Frequency / sum(Frequency),
    CumulativeProbability = cumsum(Probability)
  )

speciestabletp <- subset(speciestabletp, speciestabletp$Var1 == 'A', select = -c(Var1))

# using kable() to present the table

kable(speciestabletp, align = "c", digits = 2)%>%
  kable_styling(bootstrap_options = "basic", full_width = NULL, stripe_color = "green", table.envir = "table", protect_latex = TRUE)

speciesName	Frequency	Cumulativefreq	Probability	CumulativeProbability
Black Crappie	25	25	0.05	0.05
Bluegill	208	233	0.41	0.46
Bluntnose Minnow	100	333	0.20	0.66
Iowa Darter	31	364	0.06	0.72
Largemouth Bass	90	454	0.18	0.90
Pumpkinseed	13	467	0.03	0.92
Yellow Perch	38	505	0.08	1.00

Observation: Based on the analysis in the task 6, we able to evaluate the variable species’ individual categorical variables’ frequencies, cumulative frequencies, probabilities, and cumulative probabilities. This analysis is highly beneficial that we able to answer questions such as probability of occurring Pumpkinseed is 0.03 (lowest), whereas the probability of occurring Bluegill is 0.41 (highest). Similarly, frequencies of Pumpkinseed and Bluegill stand lowest and the highest consecutively.

Task 7
Description: Presenting pie chart to display probability, and bar plot to display cumulative probability

par(mfrow=c(1,2))

# Creating pie chart of probability data
ggplot(speciestabletp, aes(x = "", y = Probability, fill = speciesName)) +  geom_bar(stat = "identity", width = 0.5, color = "white") +
  coord_polar("y", start=0) +
  labs(fill = "Species Name") +
  ggtitle("Pie Chart - Probability of Species and its categories") +
  theme(legend.position = "bottom")+
  theme_minimal()

# Create bar plot of cumulative probability data
ggplot(speciestabletp, aes(x = speciesName, y = CumulativeProbability)) +
  geom_bar(stat = "identity", fill = "orange") +
  ggtitle("Bar Plot - Cumulative Probability of categorical variable Species") +
  xlab("Species Name") +
  ylab("Cumulative Probability") +
  theme_minimal()

Observation:
From the above graphical presentation, we able to understand that there are variations in the values of probabilities and cumulative probabilities of the each categorical variable. For instance, “Yellow Pirch” has the highest cumulative probability in the bar plot whereas its probability value in the pie chart is not the highest.

Conclusion:
In this executive summary report, we have analyzed the descriptive statistics of the individual categories of the categorical variable and presented the same in the form of table and graphs. It is easier to predict the frequencies, probabilities, and cumulative probabilities of the specific categorical value.

Learnings:
I have learnt how to use inline R code and to evaluate the probabilities and cumulative probabilities of the dataset. And also, how to use ggplot to present the graphs.

References:
1. Prof. Chiliuza, D, (2023), URL: https://rpubs.com/Dee_Chiluiza
2. Bluman, A (2018), Elementary Statistics: a step by step approach. In Bluman, A, Frequency distribution and Graphs, (pp. 47-51)

Appendix:
An R Markdown file has been attached to this report. The name of the file is “M3Project_Rmarkdown.rmd”