ALY6000 Introduction to Analytics
Instructor Name: Prof. Dee Chiluiza, PhD
Northeastern University
Student: Jayakumar Moris Udayakumar
Date: 17 November, 2024

Executive Summary Report


INTRODUCTION:
In this executive summary report, the dataset “speciesfisheries” will be analyzed. The dataset comprises the information about 7 different fish species and its respective length and weight. It has totally 505 observations. By analyzing the dataset, this report would present basic descriptive statistics, Average length and weight, variable species’ probability and frequencies and cumulative probabilities. In a graphical presentation, pie chart and bar plot displayed for cumulative probabilities. To support the analysis and for the guidance, references taken from book written by Bhuman, A (2018), and RPub blog written by Prof. Chiliuza, D, (2023).


ANALYSIS SECTION


Task 1:

Task 1A:

Description: Using code summary obtaining information about the whole “speciesfisheries” dataset and presenting the same using kable ().

netID fishID species length weight scale
Min. : 4.00 Min. : 7.0 Length:505 Min. : 30.47 Min. : 0.9027 Mode :logical
1st Qu.: 12.00 1st Qu.:169.0 Class :character 1st Qu.: 63.90 1st Qu.: 3.0711 FALSE:194
Median :101.00 Median :569.0 Mode :character Median :153.04 Median : 60.2374 TRUE :311
Mean : 78.68 Mean :487.5 NA Mean :160.52 Mean : 129.3169 NA
3rd Qu.:113.00 3rd Qu.:762.0 NA 3rd Qu.:230.11 3rd Qu.: 193.4106 NA
Max. :206.00 Max. :915.0 NA Max. :432.58 Max. :1071.8813 NA


Observation: The summary shows descriptive statistics of each variable in the dataset. And for the categorical variables such as species, it shows the length, class and mode; also the other categorical variable “scale” denotes the number of datavalues “True” as 311 and “False” as 194. For the variable “length”, min length of one of the species is 30.47, max length of one of the species is 432.58, average is 160.52, and median length is 153.04. Similarly, descriptive statistics for weight of the species: Min. weight is 0.90 and max. weight is 1071.88, median weight is 60.23 and average weight is 129.31. These are basic observations that are visible from the summary function. It is quick and informative.



Task 1B

Description: Applying glimpse function to get information

#applying glimpse function
glimpse(speciesfisheries)
## Rows: 505
## Columns: 6
## $ netID   <dbl> 5, 16, 16, 21, 24, 24, 101, 101, 101, 101, 102, 102, 102, 102,…
## $ fishID  <dbl> 137, 208, 209, 218, 268, 269, 532, 534, 535, 537, 626, 627, 62…
## $ species <chr> "Black Crappie", "Black Crappie", "Black Crappie", "Black Crap…
## $ length  <dbl> 268.75365, 298.37579, 275.35851, 154.17039, 332.43289, 309.698…
## $ weight  <dbl> 276.278409, 380.552210, 260.684962, 47.293176, 580.883012, 441…
## $ scale   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…


Observation: From the glimpse output, we able to recognize the pattern of the dataset. Aside, the format of actual dataset has transposed as column names writted down the page and it is showing the first few values of each variable. For instance, ‘netID’ variable has values in each rows as “5, 16, 16, etc.” consecutively. Also, it denotes number of rows as 505 (total observations) and available columns 6 in the dataset.



Task 1C


Comparison of the Task 1A and 1B results:

While comparing the task A and B, it resembles like a comparison of descriptive statistics of the dataset and the glimpse of the raw data. The number of rows and columns are accurate and matches with each other. As glimpse() shows just the few details of the dataset, comparison of statistical data through outputs of these tasks is hard. However, we can recognize the pattern of the dataset through glimpse().



Task 2
Description: Using Inline R codes to present the number of columns and row in the dataset

#To calculate the number of rows and columns, creating two objects
speciesfisheriescol <- ncol(speciesfisheries)
speciesfisheriesrow <- nrow(speciesfisheries)

How many variables (columns) does the data set contains?
Columns = 6
How many observations (rows) does the data set contains?
Rows = 505


Observation: From this task, I was able to observe that inline R code is helpful to show actual value of the analysis while writing a report and able to reflect the changes if occurred.



Task 3
Description: Selecting few variables (columns) to perform data analysis.

#isolating variables species, length, weight and then using headtail code to present first 5 and last 5 records of the selected variables

speciesfisheries %>%
  select(species, length, weight)%>%
  headtail(n=5)%>%
  kable(align = "c", digits = 2)%>%
  kable_styling(bootstrap_options = "basic", full_width = NULL, stripe_color = "grey", table.envir = "table", protect_latex = TRUE)
species length weight
1 Black Crappie 268.75 276.28
2 Black Crappie 298.38 380.55
3 Black Crappie 275.36 260.68
4 Black Crappie 154.17 47.29
5 Black Crappie 332.43 580.88
501 Yellow Perch 223.26 114.20
502 Yellow Perch 86.37 5.87
503 Yellow Perch 93.06 8.16
504 Yellow Perch 82.16 5.82
505 Yellow Perch 72.45 3.04

Observation: From the above table, it is easier to observe that we have pulled specific variables such as species, length, and weight from the dataset “speciesfisheries” and presented first 5 and last 5 records of the same using “headtail” code. Output is accurate when compared to raw data. On an average, Black Crappie species’ length and weight is higher than the Yellow Perch. However, Yellow Perch species has ability to grow longer and can weigh heavy as one of the observation shows its length 223,26 and weight 114.20.


Task 4

Description: Evaluating descriptive statistics for the variables “length” and “weight”

# descriptive statistics of variables length and weight

meanlength <- mean(speciesfisheries$length)
meanweight <- mean(speciesfisheries$weight)
medianlength <- median(speciesfisheries$length)
medianweight <- median(speciesfisheries$weight)
sdlength <- sd(speciesfisheries$length)
sdweight <- sd(speciesfisheries$weight)

# create vectors for the data, column names, and row names

col_names = c("length", "weight")
row_names = c("Mean", "Median", "SD")
speciesfisheriesvector = matrix(c(meanlength, meanweight, medianlength, medianweight, sdlength, sdweight), nrow = 3, byrow = TRUE)

# creating matrix

speciesfisheriestable = matrix(speciesfisheriesvector, ncol = 2, dimnames = list(row_names,col_names))

# using kable() to present the table
kable(speciesfisheriestable, align = "c", digits = 2, format = "html")%>%
  kable_styling(bootstrap_options = "basic", full_width = NULL, stripe_color = "black", table.envir = "table", protect_latex = TRUE)
length weight
Mean 160.52 129.32
Median 153.04 60.24
SD 100.29 167.57

Observation: From the above task, we able to observe that descriptive statistics such as mean, median and SD of variables “length” and “weight” derived and presented. For instance, variable “length” has average of 160.52 whereas median is 153.04; therefore, the mean is higher than the median with the difference of 7 (approx).



Task 5
Description: Inorder to select individual categories from categorical variable, applying filter function

# filtering individual categories of the categorical variable "Species" and finding mean for the variable length and weight, respectively

blackc = speciesfisheries %>%
  filter(species=="Black Crappie")
    
meanlengthbc = mean(blackc$length)
meanweightbc = mean(blackc$weight)

Blueg = speciesfisheries %>%
  filter(species=="Bluegill")

meanlengthbg = mean(Blueg$length)
meanweightbg = mean(Blueg$weight)  

Bluntnosem = speciesfisheries %>%
  filter(species=="Bluntnose Minnow")

meanlengthbm = mean(Bluntnosem$length)  
meanweightbm = mean(Bluntnosem$weight)

Iowa = speciesfisheries %>%
  filter(species=="Iowa Darter")

meanlengthiw = mean(Iowa$length)
meanweightiw = mean(Iowa$weight)

Largemouth = speciesfisheries %>%
  filter(species=="Largemouth Bass")

meanlengthlb = mean(Largemouth$length)
meanweightlb = mean(Largemouth$weight)

Pumpkin = speciesfisheries %>%
  filter(species=="Pumpkinseed")

meanlengthps = mean(Pumpkin$length)
meanweightps = mean(Pumpkin$weight)

Yellowperch = speciesfisheries %>%
  filter(species=="Yellow Perch")

meanlengthyp = mean(Yellowperch$length)  
meanweightyp = mean(Yellowperch$weight)

# create vectors for the data, column names, and row names

colnamesspecies = c("Avg length", "Avg weight")
rownamesspecies = c("Black Crappie", "Bluegill", "Bluntnose Minnow", "Iowa Darter", "Largemouth Bass", "Pumpkinseed", "Yellow Perch")
sevenspeciesvector = matrix(c(meanlengthbc, meanweightbc, meanlengthbg, meanweightbg, meanlengthbm, meanweightbm, meanlengthiw, meanweightiw, meanlengthlb, meanweightlb, meanlengthps, meanweightps, meanlengthyp, meanweightyp), nrow = 7, byrow = TRUE)

# creating matrix from the vector
sevenspeciestable = matrix(sevenspeciesvector, ncol = 2, dimnames = list(rownamesspecies,colnamesspecies))

# using kable () to present the table
kable(sevenspeciestable, digits = 2, align = "c")%>%
  kable_styling(bootstrap_options = "basic", full_width = NULL, stripe_color = "brown", table.envir = "table", protect_latex = TRUE)
Avg length Avg weight
Black Crappie 276.08 360.35
Bluegill 145.62 90.03
Bluntnose Minnow 64.19 3.03
Iowa Darter 49.43 1.88
Largemouth Bass 299.28 353.69
Pumpkinseed 135.08 99.44
Yellow Perch 190.29 107.49

Observation: From the above task and its result, we able to observe that the average length and weight of the individual categories of the categorical variable “Species” found. The species “Largemouth Bass” has the highest average length of 299.28, whereas species “Iowa Darter” has lowest average length of 49.43. On the other hand, Black Crappie seems to be the heaviest species that weighs 360.35 and the species “Iowa Darter” weighs light at 1.87.


Task 6
Description: Creating a table to present the frequencies, cumulative frequencies, probability, and cumulative probability of variable species.

Reference: Bluman, G (2018)

#Creating a name for the table
speciestable <- table(speciesfisheries$species)

#transposing the table
speciestabletp <- data.frame(t(speciestable))

#renaming the column
 speciestabletp <- speciestabletp %>% rename(Frequency = Freq)
 speciestabletp <- speciestabletp %>% rename(speciesName = Var2)

#using mutate function to create new variables for cumulative frequencies, probabilities, and cumulative probability of variable species

speciestabletp <- speciestabletp %>%
  mutate(
    Cumulativefreq = cumsum(Frequency),
    Probability = Frequency / sum(Frequency),
    CumulativeProbability = cumsum(Probability)
  )

speciestabletp <- subset(speciestabletp, speciestabletp$Var1 == 'A', select = -c(Var1))

# using kable() to present the table

kable(speciestabletp, align = "c", digits = 2)%>%
  kable_styling(bootstrap_options = "basic", full_width = NULL, stripe_color = "green", table.envir = "table", protect_latex = TRUE)
speciesName Frequency Cumulativefreq Probability CumulativeProbability
Black Crappie 25 25 0.05 0.05
Bluegill 208 233 0.41 0.46
Bluntnose Minnow 100 333 0.20 0.66
Iowa Darter 31 364 0.06 0.72
Largemouth Bass 90 454 0.18 0.90
Pumpkinseed 13 467 0.03 0.92
Yellow Perch 38 505 0.08 1.00

Observation: Based on the analysis in the task 6, we able to evaluate the variable species’ individual categorical variables’ frequencies, cumulative frequencies, probabilities, and cumulative probabilities. This analysis is highly beneficial that we able to answer questions such as probability of occurring Pumpkinseed is 0.03 (lowest), whereas the probability of occurring Bluegill is 0.41 (highest). Similarly, frequencies of Pumpkinseed and Bluegill stand lowest and the highest consecutively.


Task 7
Description: Presenting pie chart to display probability, and bar plot to display cumulative probability

par(mfrow=c(1,2))

# Creating pie chart of probability data
ggplot(speciestabletp, aes(x = "", y = Probability, fill = speciesName)) +  geom_bar(stat = "identity", width = 0.5, color = "white") +
  coord_polar("y", start=0) +
  labs(fill = "Species Name") +
  ggtitle("Pie Chart - Probability of Species and its categories") +
  theme(legend.position = "bottom")+
  theme_minimal()

# Create bar plot of cumulative probability data
ggplot(speciestabletp, aes(x = speciesName, y = CumulativeProbability)) +
  geom_bar(stat = "identity", fill = "orange") +
  ggtitle("Bar Plot - Cumulative Probability of categorical variable Species") +
  xlab("Species Name") +
  ylab("Cumulative Probability") +
  theme_minimal()

Observation:
From the above graphical presentation, we able to understand that there are variations in the values of probabilities and cumulative probabilities of the each categorical variable. For instance, “Yellow Pirch” has the highest cumulative probability in the bar plot whereas its probability value in the pie chart is not the highest.

Conclusion:
In this executive summary report, we have analyzed the descriptive statistics of the individual categories of the categorical variable and presented the same in the form of table and graphs. It is easier to predict the frequencies, probabilities, and cumulative probabilities of the specific categorical value.

Learnings:
I have learnt how to use inline R code and to evaluate the probabilities and cumulative probabilities of the dataset. And also, how to use ggplot to present the graphs.

References:
1. Prof. Chiliuza, D, (2023), URL: https://rpubs.com/Dee_Chiluiza
2. Bluman, A (2018), Elementary Statistics: a step by step approach. In Bluman, A, Frequency distribution and Graphs, (pp. 47-51)


Appendix:
An R Markdown file has been attached to this report. The name of the file is “M3Project_Rmarkdown.rmd”