Executive Summary Report
INTRODUCTION:
In this executive summary report, the
dataset “speciesfisheries” will be analyzed. The dataset comprises the
information about 7 different fish species and its respective length and
weight. It has totally 505 observations. By analyzing the dataset, this
report would present basic descriptive statistics, Average length and
weight, variable species’ probability and frequencies and cumulative
probabilities. In a graphical presentation, pie chart and bar plot
displayed for cumulative probabilities. To support the analysis and for
the guidance, references taken from book written by Bhuman, A (2018),
and RPub blog written by Prof. Chiliuza, D, (2023).
ANALYSIS SECTION
Task 1:
Task 1A:
Description: Using code summary obtaining information about the whole “speciesfisheries” dataset and presenting the same using kable ().
| netID | fishID | species | length | weight | scale | |
|---|---|---|---|---|---|---|
| Min. : 4.00 | Min. : 7.0 | Length:505 | Min. : 30.47 | Min. : 0.9027 | Mode :logical | |
| 1st Qu.: 12.00 | 1st Qu.:169.0 | Class :character | 1st Qu.: 63.90 | 1st Qu.: 3.0711 | FALSE:194 | |
| Median :101.00 | Median :569.0 | Mode :character | Median :153.04 | Median : 60.2374 | TRUE :311 | |
| Mean : 78.68 | Mean :487.5 | NA | Mean :160.52 | Mean : 129.3169 | NA | |
| 3rd Qu.:113.00 | 3rd Qu.:762.0 | NA | 3rd Qu.:230.11 | 3rd Qu.: 193.4106 | NA | |
| Max. :206.00 | Max. :915.0 | NA | Max. :432.58 | Max. :1071.8813 | NA |
Observation: The summary shows descriptive statistics
of each variable in the dataset. And for the categorical variables such
as species, it shows the length, class and mode; also the other
categorical variable “scale” denotes the number of datavalues “True” as
311 and “False” as 194. For the variable “length”, min length of one of
the species is 30.47, max length of one of the species is 432.58,
average is 160.52, and median length is 153.04. Similarly, descriptive
statistics for weight of the species: Min. weight is 0.90 and max.
weight is 1071.88, median weight is 60.23 and average weight is 129.31.
These are basic observations that are visible from the summary function.
It is quick and informative.
Task 1B
Description: Applying glimpse function to get information
#applying glimpse function
glimpse(speciesfisheries)
## Rows: 505
## Columns: 6
## $ netID <dbl> 5, 16, 16, 21, 24, 24, 101, 101, 101, 101, 102, 102, 102, 102,…
## $ fishID <dbl> 137, 208, 209, 218, 268, 269, 532, 534, 535, 537, 626, 627, 62…
## $ species <chr> "Black Crappie", "Black Crappie", "Black Crappie", "Black Crap…
## $ length <dbl> 268.75365, 298.37579, 275.35851, 154.17039, 332.43289, 309.698…
## $ weight <dbl> 276.278409, 380.552210, 260.684962, 47.293176, 580.883012, 441…
## $ scale <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
Observation: From the glimpse output, we able to
recognize the pattern of the dataset. Aside, the format of actual
dataset has transposed as column names writted down the page and it is
showing the first few values of each variable. For instance, ‘netID’
variable has values in each rows as “5, 16, 16, etc.” consecutively.
Also, it denotes number of rows as 505 (total observations) and
available columns 6 in the dataset.
Task 1C
Comparison of the Task 1A and 1B results:
While comparing the task A and B, it resembles like a comparison of descriptive statistics of the dataset and the glimpse of the raw data. The number of rows and columns are accurate and matches with each other. As glimpse() shows just the few details of the dataset, comparison of statistical data through outputs of these tasks is hard. However, we can recognize the pattern of the dataset through glimpse().
Task 2
Description: Using Inline R codes
to present the number of columns and row in the dataset
#To calculate the number of rows and columns, creating two objects
speciesfisheriescol <- ncol(speciesfisheries)
speciesfisheriesrow <- nrow(speciesfisheries)
How many variables (columns) does the data set contains? Columns = 6 How many observations (rows) does the data set contains? Rows = 505
Observation: From this task, I was able to observe that inline R code is helpful to show actual value of the analysis while writing a report and able to reflect the changes if occurred.
Task 3
Description: Selecting few variables
(columns) to perform data analysis.
#isolating variables species, length, weight and then using headtail code to present first 5 and last 5 records of the selected variables
speciesfisheries %>%
select(species, length, weight)%>%
headtail(n=5)%>%
kable(align = "c", digits = 2)%>%
kable_styling(bootstrap_options = "basic", full_width = NULL, stripe_color = "grey", table.envir = "table", protect_latex = TRUE)
| species | length | weight | |
|---|---|---|---|
| 1 | Black Crappie | 268.75 | 276.28 |
| 2 | Black Crappie | 298.38 | 380.55 |
| 3 | Black Crappie | 275.36 | 260.68 |
| 4 | Black Crappie | 154.17 | 47.29 |
| 5 | Black Crappie | 332.43 | 580.88 |
| 501 | Yellow Perch | 223.26 | 114.20 |
| 502 | Yellow Perch | 86.37 | 5.87 |
| 503 | Yellow Perch | 93.06 | 8.16 |
| 504 | Yellow Perch | 82.16 | 5.82 |
| 505 | Yellow Perch | 72.45 | 3.04 |
Observation: From the above table, it is easier to observe that we have pulled specific variables such as species, length, and weight from the dataset “speciesfisheries” and presented first 5 and last 5 records of the same using “headtail” code. Output is accurate when compared to raw data. On an average, Black Crappie species’ length and weight is higher than the Yellow Perch. However, Yellow Perch species has ability to grow longer and can weigh heavy as one of the observation shows its length 223,26 and weight 114.20.
Task 4
Description: Evaluating descriptive statistics for the variables “length” and “weight”
# descriptive statistics of variables length and weight
meanlength <- mean(speciesfisheries$length)
meanweight <- mean(speciesfisheries$weight)
medianlength <- median(speciesfisheries$length)
medianweight <- median(speciesfisheries$weight)
sdlength <- sd(speciesfisheries$length)
sdweight <- sd(speciesfisheries$weight)
# create vectors for the data, column names, and row names
col_names = c("length", "weight")
row_names = c("Mean", "Median", "SD")
speciesfisheriesvector = matrix(c(meanlength, meanweight, medianlength, medianweight, sdlength, sdweight), nrow = 3, byrow = TRUE)
# creating matrix
speciesfisheriestable = matrix(speciesfisheriesvector, ncol = 2, dimnames = list(row_names,col_names))
# using kable() to present the table
kable(speciesfisheriestable, align = "c", digits = 2, format = "html")%>%
kable_styling(bootstrap_options = "basic", full_width = NULL, stripe_color = "black", table.envir = "table", protect_latex = TRUE)
| length | weight | |
|---|---|---|
| Mean | 160.52 | 129.32 |
| Median | 153.04 | 60.24 |
| SD | 100.29 | 167.57 |
Observation: From the above task, we able to observe that descriptive statistics such as mean, median and SD of variables “length” and “weight” derived and presented. For instance, variable “length” has average of 160.52 whereas median is 153.04; therefore, the mean is higher than the median with the difference of 7 (approx).
Task 5
Description: Inorder to select
individual categories from categorical variable, applying filter
function
# filtering individual categories of the categorical variable "Species" and finding mean for the variable length and weight, respectively
blackc = speciesfisheries %>%
filter(species=="Black Crappie")
meanlengthbc = mean(blackc$length)
meanweightbc = mean(blackc$weight)
Blueg = speciesfisheries %>%
filter(species=="Bluegill")
meanlengthbg = mean(Blueg$length)
meanweightbg = mean(Blueg$weight)
Bluntnosem = speciesfisheries %>%
filter(species=="Bluntnose Minnow")
meanlengthbm = mean(Bluntnosem$length)
meanweightbm = mean(Bluntnosem$weight)
Iowa = speciesfisheries %>%
filter(species=="Iowa Darter")
meanlengthiw = mean(Iowa$length)
meanweightiw = mean(Iowa$weight)
Largemouth = speciesfisheries %>%
filter(species=="Largemouth Bass")
meanlengthlb = mean(Largemouth$length)
meanweightlb = mean(Largemouth$weight)
Pumpkin = speciesfisheries %>%
filter(species=="Pumpkinseed")
meanlengthps = mean(Pumpkin$length)
meanweightps = mean(Pumpkin$weight)
Yellowperch = speciesfisheries %>%
filter(species=="Yellow Perch")
meanlengthyp = mean(Yellowperch$length)
meanweightyp = mean(Yellowperch$weight)
# create vectors for the data, column names, and row names
colnamesspecies = c("Avg length", "Avg weight")
rownamesspecies = c("Black Crappie", "Bluegill", "Bluntnose Minnow", "Iowa Darter", "Largemouth Bass", "Pumpkinseed", "Yellow Perch")
sevenspeciesvector = matrix(c(meanlengthbc, meanweightbc, meanlengthbg, meanweightbg, meanlengthbm, meanweightbm, meanlengthiw, meanweightiw, meanlengthlb, meanweightlb, meanlengthps, meanweightps, meanlengthyp, meanweightyp), nrow = 7, byrow = TRUE)
# creating matrix from the vector
sevenspeciestable = matrix(sevenspeciesvector, ncol = 2, dimnames = list(rownamesspecies,colnamesspecies))
# using kable () to present the table
kable(sevenspeciestable, digits = 2, align = "c")%>%
kable_styling(bootstrap_options = "basic", full_width = NULL, stripe_color = "brown", table.envir = "table", protect_latex = TRUE)
| Avg length | Avg weight | |
|---|---|---|
| Black Crappie | 276.08 | 360.35 |
| Bluegill | 145.62 | 90.03 |
| Bluntnose Minnow | 64.19 | 3.03 |
| Iowa Darter | 49.43 | 1.88 |
| Largemouth Bass | 299.28 | 353.69 |
| Pumpkinseed | 135.08 | 99.44 |
| Yellow Perch | 190.29 | 107.49 |
Observation: From the above task and its result, we able to observe that the average length and weight of the individual categories of the categorical variable “Species” found. The species “Largemouth Bass” has the highest average length of 299.28, whereas species “Iowa Darter” has lowest average length of 49.43. On the other hand, Black Crappie seems to be the heaviest species that weighs 360.35 and the species “Iowa Darter” weighs light at 1.87.
Task 6
Description: Creating a table to present
the frequencies, cumulative frequencies, probability, and cumulative
probability of variable species.
Reference: Bluman, G (2018)
#Creating a name for the table
speciestable <- table(speciesfisheries$species)
#transposing the table
speciestabletp <- data.frame(t(speciestable))
#renaming the column
speciestabletp <- speciestabletp %>% rename(Frequency = Freq)
speciestabletp <- speciestabletp %>% rename(speciesName = Var2)
#using mutate function to create new variables for cumulative frequencies, probabilities, and cumulative probability of variable species
speciestabletp <- speciestabletp %>%
mutate(
Cumulativefreq = cumsum(Frequency),
Probability = Frequency / sum(Frequency),
CumulativeProbability = cumsum(Probability)
)
speciestabletp <- subset(speciestabletp, speciestabletp$Var1 == 'A', select = -c(Var1))
# using kable() to present the table
kable(speciestabletp, align = "c", digits = 2)%>%
kable_styling(bootstrap_options = "basic", full_width = NULL, stripe_color = "green", table.envir = "table", protect_latex = TRUE)
| speciesName | Frequency | Cumulativefreq | Probability | CumulativeProbability |
|---|---|---|---|---|
| Black Crappie | 25 | 25 | 0.05 | 0.05 |
| Bluegill | 208 | 233 | 0.41 | 0.46 |
| Bluntnose Minnow | 100 | 333 | 0.20 | 0.66 |
| Iowa Darter | 31 | 364 | 0.06 | 0.72 |
| Largemouth Bass | 90 | 454 | 0.18 | 0.90 |
| Pumpkinseed | 13 | 467 | 0.03 | 0.92 |
| Yellow Perch | 38 | 505 | 0.08 | 1.00 |
Observation: Based on the analysis in the task 6, we able to evaluate the variable species’ individual categorical variables’ frequencies, cumulative frequencies, probabilities, and cumulative probabilities. This analysis is highly beneficial that we able to answer questions such as probability of occurring Pumpkinseed is 0.03 (lowest), whereas the probability of occurring Bluegill is 0.41 (highest). Similarly, frequencies of Pumpkinseed and Bluegill stand lowest and the highest consecutively.
Task 7
Description: Presenting pie chart to
display probability, and bar plot to display cumulative probability
par(mfrow=c(1,2))
# Creating pie chart of probability data
ggplot(speciestabletp, aes(x = "", y = Probability, fill = speciesName)) + geom_bar(stat = "identity", width = 0.5, color = "white") +
coord_polar("y", start=0) +
labs(fill = "Species Name") +
ggtitle("Pie Chart - Probability of Species and its categories") +
theme(legend.position = "bottom")+
theme_minimal()
# Create bar plot of cumulative probability data
ggplot(speciestabletp, aes(x = speciesName, y = CumulativeProbability)) +
geom_bar(stat = "identity", fill = "orange") +
ggtitle("Bar Plot - Cumulative Probability of categorical variable Species") +
xlab("Species Name") +
ylab("Cumulative Probability") +
theme_minimal()
Observation:
From the above graphical presentation, we
able to understand that there are variations in the values of
probabilities and cumulative probabilities of the each categorical
variable. For instance, “Yellow Pirch” has the highest cumulative
probability in the bar plot whereas its probability value in the pie
chart is not the highest.
Conclusion:
In this executive summary report, we have
analyzed the descriptive statistics of the individual categories of the
categorical variable and presented the same in the form of table and
graphs. It is easier to predict the frequencies, probabilities, and
cumulative probabilities of the specific categorical value.
Learnings:
I have learnt how to use inline R code and to
evaluate the probabilities and cumulative probabilities of the dataset.
And also, how to use ggplot to present the graphs.
References:
1. Prof. Chiliuza, D, (2023), URL: https://rpubs.com/Dee_Chiluiza
2. Bluman, A (2018),
Elementary Statistics: a step by step approach. In Bluman, A, Frequency
distribution and Graphs, (pp. 47-51)
Appendix:
An R Markdown file has been attached to
this report. The name of the file is “M3Project_Rmarkdown.rmd”