INTRODUCTION
Importance and Practical Applications of Descriptive Statistics
Descriptive statistics play a crucial role in summarizing and interpreting data. They provide a concise and meaningful overview of datasets, making it easier for individuals and organizations to make informed decisions. Practical Applications:
Business: Descriptive statistics help businesses analyze sales data, customer demographics, and market trends to make marketing and product development decisions.
Healthcare: Medical professionals use descriptive statistics to summarize patient data, study disease patterns, and assess treatment outcomes.
Education: Educators use descriptive statistics to evaluate student performance, identify areas for improvement, and design effective teaching strategies. Social Sciences: Researchers in sociology and psychology use descriptive statistics to analyze survey results, conduct experiments, and draw conclusions about human behavior.
References:
Practical Applications of R for Data Analysis
R is a powerful open-source programming language and environment widely used for data analysis and statistical modeling.
Practical Applications:
Data Cleaning: R is used to clean and preprocess messy data, handling missing values, outliers, and inconsistencies.
Data Visualization: R offers numerous packages for creating insightful data visualizations, including scatter plots, histograms, and heat maps.
Statistical Analysis: Researchers and analysts use R for hypothesis testing, regression analysis, and complex statistical modeling.
Machine Learning: R provides libraries for machine learning tasks such as classification, clustering, and predictive modeling.
Data Reporting: R can generate automated reports with dynamic charts and tables, enhancing data-driven decision-making.
References
The winery industry involves the production of wine, encompassing vineyard cultivation, grape harvesting, fermentation, and bottling. Variables such as alcohol content and magnesium levels are essential for winemakers and wine enthusiasts:
Alcohol Content: Alcohol percentage is a critical factor influencing wine taste, mouth feel, and quality. It affects the wine’s body, sweetness, and overall balance.
Magnesium: Magnesium levels in soil and grapes influence vine health, grape quality, and wine characteristics. Appropriate magnesium levels are essential for vine growth and grape development. Importance of Analytics:
Quality Control: Analytics can help winemakers monitor and control alcohol levels during fermentation, ensuring the desired taste and style.
Soil Analysis: Analytics enable soil testing to determine magnesium and other nutrient levels, optimizing vineyard conditions. Predictive Modeling: Data analytics can predict wine quality based on factors like alcohol and magnesium, guiding wine making decisions.
References
ANALYSIS
# Libraries used
library(dplyr)
library(lattice)
library(kableExtra)
library(ggplot2)
library(readxl)
wineType <- read_excel("wineType.xlsx")
#Task 1
#use glimpse code
glimpse(wineType)
## Rows: 178
## Columns: 14
## $ Wine_Type <chr> "Region 1", "Region 1", "Region 1", "Re…
## $ Alcohol <dbl> 14.23, 13.20, 13.16, 14.37, 13.24, 14.2…
## $ Malic_acid <dbl> 1.71, 1.78, 2.36, 1.95, 2.59, 1.76, 1.8…
## $ Ash <dbl> 2.43, 2.14, 2.67, 2.50, 2.87, 2.45, 2.4…
## $ Alcalinity_of_ash <dbl> 15.83, 11.58, 19.16, 17.61, 21.41, 15.8…
## $ Magnesium <dbl> 127.61, 100.48, 101.02, 113.96, 118.60,…
## $ Total_phenols <dbl> 2.80, 2.65, 2.80, 3.85, 2.80, 3.27, 2.5…
## $ Flavanoids <dbl> 3.06, 2.76, 3.24, 3.49, 2.69, 3.39, 2.5…
## $ Nonflavanoid_phenols <dbl> 0.28, 0.26, 0.30, 0.24, 0.39, 0.34, 0.3…
## $ Proanthocyanins <dbl> 2.29, 1.28, 2.81, 2.18, 1.82, 1.97, 1.9…
## $ Color_intensity <dbl> 5.64, 4.38, 5.68, 7.80, 4.32, 6.75, 5.2…
## $ Hue <dbl> 1.04, 1.05, 1.03, 0.86, 1.04, 1.05, 1.0…
## $ `OD280/OD315_of_diluted_wines` <dbl> 3.92, 3.40, 3.17, 3.45, 2.93, 2.85, 3.5…
## $ Proline <dbl> 1065, 1050, 1185, 1480, 735, 1450, 1290…
Observations
Number of Observations and Variables: The first line shows the total number of observations (rows) and variables (columns) in your dataset.
Variable Names and Data Types: Each subsequent line provides information about each variable in your dataset, including its name and data type (e.g., dbl for double, fct for factor).
Sample Data: A sample of data values for each variable is displayed, giving you a glimpse of the dataset’s content.
Observations: glimpse() typically shows the first few observations to give you an idea of what the data looks like.
#Task 2
# Rename column and create a new dataset
wine_all <- wineType %>%
rename(Region = Wine_Type)
# Display a glimpse of the new dataset
glimpse(wine_all)
## Rows: 178
## Columns: 14
## $ Region <chr> "Region 1", "Region 1", "Region 1", "Re…
## $ Alcohol <dbl> 14.23, 13.20, 13.16, 14.37, 13.24, 14.2…
## $ Malic_acid <dbl> 1.71, 1.78, 2.36, 1.95, 2.59, 1.76, 1.8…
## $ Ash <dbl> 2.43, 2.14, 2.67, 2.50, 2.87, 2.45, 2.4…
## $ Alcalinity_of_ash <dbl> 15.83, 11.58, 19.16, 17.61, 21.41, 15.8…
## $ Magnesium <dbl> 127.61, 100.48, 101.02, 113.96, 118.60,…
## $ Total_phenols <dbl> 2.80, 2.65, 2.80, 3.85, 2.80, 3.27, 2.5…
## $ Flavanoids <dbl> 3.06, 2.76, 3.24, 3.49, 2.69, 3.39, 2.5…
## $ Nonflavanoid_phenols <dbl> 0.28, 0.26, 0.30, 0.24, 0.39, 0.34, 0.3…
## $ Proanthocyanins <dbl> 2.29, 1.28, 2.81, 2.18, 1.82, 1.97, 1.9…
## $ Color_intensity <dbl> 5.64, 4.38, 5.68, 7.80, 4.32, 6.75, 5.2…
## $ Hue <dbl> 1.04, 1.05, 1.03, 0.86, 1.04, 1.05, 1.0…
## $ `OD280/OD315_of_diluted_wines` <dbl> 3.92, 3.40, 3.17, 3.45, 2.93, 2.85, 3.5…
## $ Proline <dbl> 1065, 1050, 1185, 1480, 735, 1450, 1290…
Observation
With the dplyr package, I used the %>% pipe operator to take the original dataset wineType, With rename(Region = Wine_Type), I specified that I want to rename the column “Wine_Type” to “Region” in the new dataset. The result is stored in a new dataset called wine_all, I then used glimpse(wine_all) to display a glimpse of the new dataset, confirming that the column name has been corrected.
#Task 3
# Create a box plot of Alcohol distribution per Region
boxplot(Alcohol ~ Region, data = wine_all,
main = "Alcohol Distribution per Region",
xlab = "Region", ylab = "Alcohol Content")
Obsrvation
Region 2 has two outliers,minimum value is below ten, Q1,Q2,and Q3 while fourth quantile is twelve
#Task 4
# Calculate the global mean Alcohol
global_mean_alcohol <- summarise(wine_all, Mean_Alcohol = mean(Alcohol))
# Display the value
global_mean_alcohol$Mean_Alcohol
## [1] 13.37792
#Task 5
# Create wine2 data with group by and summarise
wine2 <- wine_all %>%
group_by(Region) %>%
summarise(Mean_Alcohol = mean(Alcohol))
# Format the table
wine2_table <- wine2 %>%
kable(align = c("l", "c"), table.attr = "style='width:40%;'") %>%
kable_classic_2()
wine2_table
| Region | Mean_Alcohol |
|---|---|
| Region 1 | 13.74898 |
| Region 2 | 11.18469 |
| Region 3 | 16.16596 |
#Task 6
# Calculate the mean alcohol content per Region
mean_alcohol <- tapply(wine_all$Alcohol, wine_all$Region, mean)
# Create a table from the results
wine2_table <- data.frame(Region = names(mean_alcohol), Mean_Alcohol = mean_alcohol)
# Print the table
print(wine2_table)
## Region Mean_Alcohol
## Region 1 Region 1 13.74898
## Region 2 Region 2 11.18469
## Region 3 Region 3 16.16596
#Task 7
# Create a new data set wine3 with corrected region names
wine3 <- wine_all %>%
mutate(Region = recode(Region,
`1` = "California",
`2` = "Colorado",
`3` = "Massachusetts"))
#Task 8
# Create a table for the average Proline per Region
proline_summary <- wine3 %>%
group_by(Region) %>%
summarise(Avg_Proline = mean(Proline, na.rm = TRUE)) %>%
mutate(Region = factor(Region,levels = c("California", "Colorado", "Massachusetts")))
# Print the table with kable
proline_table <- proline_summary %>%
kable("html", digit=1, caption = "Average Proline per Region", align = "c") %>%
kable_styling("striped")
# Print the table
proline_table
| Region | Avg_Proline |
|---|---|
| NA | 1115.7 |
| NA | 519.5 |
| NA | 629.9 |
# Create histograms for Proline per region
histogram(~ Proline | Region, data = wine3, main = "Proline Distribution by Region", xlab = "Proline", layout = c(1, 3))
#task 9
# Count observations per category in Region
region_counts <- wine3 %>%
count(Region) %>%
as.matrix() %>%
kable("html", caption = "Observations per Category in Region", align = "c") %>%
kable_styling("striped")
# Print the table
region_counts
| Region | n |
|---|---|
| Region 1 | 59 |
| Region 2 | 71 |
| Region 3 | 48 |
#Task 10
# Create three objects for magnesium classes
magnesium_class1 <- wine3 %>%
filter(Magnesium >= min(Magnesium) & Magnesium <100)
magnesium_class2 <- wine3 %>%
filter(Magnesium >= 100 & Magnesium<130)
magnesium_class3 <- wine3 %>%
filter(Magnesium >=130)
# Set options to replace NA with '0'
options(knitr.kable.NA = '0')
# Create tables for each magnesium class
table1 <- table(magnesium_class1$Region)
table2 <- table(magnesium_class2$Region)
table3 <- table(magnesium_class3$Region)
# Combine tables into one data frame
combined_table <- bind_rows(
data.frame(Class = "Class 1", table1),
data.frame(Class = "Class 2", table2),
data.frame(Class = "Class 3", table3)
)
# Improve table presentation using kable
library(kableExtra)
formatted_table <- kable(combined_table, align = "c") %>%
kable_classic_2(full_width = FALSE)
# Draw a pie chart
library(ggplot2)
pie_chart <- ggplot(combined_table, aes(x = "", y = Freq, fill = Region)) +
geom_bar(stat = "identity") +
coord_polar("y", start = 0) +
labs(title = "Distribution of Regions in Magnesium Classes")
# Display the formatted table and pie chart
formatted_table
| Class | Var1 | Freq |
|---|---|---|
| Class 1 | Region 1 | 16 |
| Class 1 | Region 2 | 53 |
| Class 1 | Region 3 | 28 |
| Class 2 | Region 1 | 42 |
| Class 2 | Region 2 | 13 |
| Class 2 | Region 3 | 20 |
| Class 3 | Region 1 | 1 |
| Class 3 | Region 2 | 5 |
#Task 11
# Create tables for each magnesium class
table_class1 <- table(magnesium_class1$Region)
table_class2 <- table(magnesium_class2$Region)
table_class3 <- table(magnesium_class3$Region)
# Bind all three tables together
combined_tables <- bind_rows(
data.frame(Class = "Low", table_class1),
data.frame(Class = "Medium", table_class2),
data.frame(Class = "High", table_class3)
)
# Transform to a data frame
combined_df <- as.data.frame(combined_tables)
# Set NA values to empty space
options(knitr.kable.NA = '')
# Print the improved table
kable(combined_df, format = "pipe", caption = "observation Magnesium Classes")
| Class | Var1 | Freq |
|---|---|---|
| Low | Region 1 | 16 |
| Low | Region 2 | 53 |
| Low | Region 3 | 28 |
| Medium | Region 1 | 42 |
| Medium | Region 2 | 13 |
| Medium | Region 3 | 20 |
| High | Region 1 | 1 |
| High | Region 2 | 5 |
The code provided in the previous sections aimed to analyse a dataset related to wine regions and characteristics. The following is a summary of the results and observations obtained from the code:
Data Loading and Preliminary Exploration:
The dataset was loaded and renamed to correct the variable name.
A glimpse of the dataset was taken to understand its structure.
Data Cleaning:
Region names were corrected from numerical values to their respective state names (California, Colorado, and Massachusetts).
Descriptive Statistics and Visualisation:
Box plots were used to visualise the distribution of alcohol content per region.
A table was created to present the average Proline content per region.
Histograms were generated to visualize Proline distribution by region.
Frequency Analysis:
The count of observations per region category was computed and presented in a table.
Magnesium Class Analysis:
The dataset was filtered into three classes based on magnesium content.
Tables were created to count the observations per region within each magnesium class.
All three tables were combined and presented.
Project Direction and Observations:
The analysis provided insights into the distribution of alcohol and proline content across different wine regions.
Differences in alcohol content were visualised using box plots.
The average Proline content varied by region and was presented in a table.
The proline distribution was visualised using histograms. They were showing potential differences.
The frequency analysis highlighted the number of observations per region category.
The magnesium class analysis showed variations in observations within each class.
New skills gained:
Improved data manipulation skills using dplyr functions.
Enhanced data visualisation using lattice and kableExtra libraries.
Efficiently handling data summarization and presentation.
Recommendations:
Further analysis could explore relationships between wine characteristics and regions.
Identifying outliers and their impact on wine quality could be valuable.
Additional variables related to wine quality could be considered for analysis.
Strategy for Organisation:
The data was loaded and explored step by step.
Every analysis task was divided into doable portions.
Using the dplyr and kableExtra libraries consistently ensured effective presentation and manipulation of data.
Documentation and comments were used to explain each code section.
The analysis provided valuable insights into wine characteristics across different regions. More investigation and Research may result in a deeper comprehension and even serve as a guide. decision-making in the wine industry or research institutions. The project necessary abilities in data manipulation, systematic organisation, and an efficient visualisation technique