Adegoke_ALY6000Project4.knit

ALY 6000
Northeastern University
Oladimeji Adegoke
Project 4 Report
Instructor: Prof Dee Chiliuza

INTRODUCTION

Importance and Practical Applications of Descriptive Statistics

Descriptive statistics play a crucial role in summarizing and interpreting data. They provide a concise and meaningful overview of datasets, making it easier for individuals and organizations to make informed decisions. Practical Applications:

Business: Descriptive statistics help businesses analyze sales data, customer demographics, and market trends to make marketing and product development decisions.

Healthcare: Medical professionals use descriptive statistics to summarize patient data, study disease patterns, and assess treatment outcomes.

Education: Educators use descriptive statistics to evaluate student performance, identify areas for improvement, and design effective teaching strategies. Social Sciences: Researchers in sociology and psychology use descriptive statistics to analyze survey results, conduct experiments, and draw conclusions about human behavior.

References:

Trochim, W. M. (2006). Descriptive statistics. Research methods knowledge base. Link
Everitt, B. S., & Skrondal, A. (2010). The Cambridge Dictionary of Statistics. Cambridge University Press.

Practical Applications of R for Data Analysis

R is a powerful open-source programming language and environment widely used for data analysis and statistical modeling.

Practical Applications:

Data Cleaning: R is used to clean and preprocess messy data, handling missing values, outliers, and inconsistencies.

Data Visualization: R offers numerous packages for creating insightful data visualizations, including scatter plots, histograms, and heat maps.

Statistical Analysis: Researchers and analysts use R for hypothesis testing, regression analysis, and complex statistical modeling.

Machine Learning: R provides libraries for machine learning tasks such as classification, clustering, and predictive modeling.

Data Reporting: R can generate automated reports with dynamic charts and tables, enhancing data-driven decision-making.

References

Wickham, H., & Grolemund, G. (2017). R for Data Science. O’Reilly Media.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.

Basic Information on the Winery Industry and the Importance of Variables Alcohol and Magnesium

The winery industry involves the production of wine, encompassing vineyard cultivation, grape harvesting, fermentation, and bottling. Variables such as alcohol content and magnesium levels are essential for winemakers and wine enthusiasts:

Alcohol Content: Alcohol percentage is a critical factor influencing wine taste, mouth feel, and quality. It affects the wine’s body, sweetness, and overall balance.

Magnesium: Magnesium levels in soil and grapes influence vine health, grape quality, and wine characteristics. Appropriate magnesium levels are essential for vine growth and grape development. Importance of Analytics:

Quality Control: Analytics can help winemakers monitor and control alcohol levels during fermentation, ensuring the desired taste and style.

Soil Analysis: Analytics enable soil testing to determine magnesium and other nutrient levels, optimizing vineyard conditions. Predictive Modeling: Data analytics can predict wine quality based on factors like alcohol and magnesium, guiding wine making decisions.

References

Jackson, R. S. (2008). Wine Science: Principles and Applications. Academic Press.
Hardie, W. J., & Considine, J. A. (1976). Magnesium nutrition of grapevines. Journal of the American Society for Horticultural Science, 101(4), 411-415.

ANALYSIS

Region	Mean_Alcohol
Region 1	13.74898
Region 2	11.18469
Region 3	16.16596

Average Proline per Region
Region	Avg_Proline
NA	1115.7
NA	519.5
NA	629.9

Observations per Category in Region
Region	n
Region 1	59
Region 2	71
Region 3	48

Class	Var1	Freq
Class 1	Region 1	16
Class 1	Region 2	53
Class 1	Region 3	28
Class 2	Region 1	42
Class 2	Region 2	13
Class 2	Region 3	20
Class 3	Region 1	1
Class 3	Region 2	5

observation Magnesium Classes
Class	Var1	Freq
Low	Region 1	16
Low	Region 2	53
Low	Region 3	28
Medium	Region 1	42
Medium	Region 2	13
Medium	Region 3	20
High	Region 1	1
High	Region 2	5

# Libraries used library(dplyr) library(lattice) library(kableExtra) library(ggplot2) library(readxl) wineType <- read_excel("wineType.xlsx")

#Task 1 #use glimpse code glimpse(wineType)

## Rows: 178 ## Columns: 14 ## $ Wine_Type <chr> "Region 1", "Region 1", "Region 1", "Re… ## $ Alcohol <dbl> 14.23, 13.20, 13.16, 14.37, 13.24, 14.2… ## $ Malic_acid <dbl> 1.71, 1.78, 2.36, 1.95, 2.59, 1.76, 1.8… ## $ Ash <dbl> 2.43, 2.14, 2.67, 2.50, 2.87, 2.45, 2.4… ## $ Alcalinity_of_ash <dbl> 15.83, 11.58, 19.16, 17.61, 21.41, 15.8… ## $ Magnesium <dbl> 127.61, 100.48, 101.02, 113.96, 118.60,… ## $ Total_phenols <dbl> 2.80, 2.65, 2.80, 3.85, 2.80, 3.27, 2.5… ## $ Flavanoids <dbl> 3.06, 2.76, 3.24, 3.49, 2.69, 3.39, 2.5… ## $ Nonflavanoid_phenols <dbl> 0.28, 0.26, 0.30, 0.24, 0.39, 0.34, 0.3… ## $ Proanthocyanins <dbl> 2.29, 1.28, 2.81, 2.18, 1.82, 1.97, 1.9… ## $ Color_intensity <dbl> 5.64, 4.38, 5.68, 7.80, 4.32, 6.75, 5.2… ## $ Hue <dbl> 1.04, 1.05, 1.03, 0.86, 1.04, 1.05, 1.0… ## $ `OD280/OD315_of_diluted_wines` <dbl> 3.92, 3.40, 3.17, 3.45, 2.93, 2.85, 3.5… ## $ Proline <dbl> 1065, 1050, 1185, 1480, 735, 1450, 1290…
Observations

Number of Observations and Variables: The first line shows the total number of observations (rows) and variables (columns) in your dataset.

Variable Names and Data Types: Each subsequent line provides information about each variable in your dataset, including its name and data type (e.g., dbl for double, fct for factor).

Sample Data: A sample of data values for each variable is displayed, giving you a glimpse of the dataset’s content.

Observations: glimpse() typically shows the first few observations to give you an idea of what the data looks like.

#Task 2 # Rename column and create a new dataset wine_all <- wineType %>% rename(Region = Wine_Type) # Display a glimpse of the new dataset glimpse(wine_all)

## Rows: 178 ## Columns: 14 ## $ Region <chr> "Region 1", "Region 1", "Region 1", "Re… ## $ Alcohol <dbl> 14.23, 13.20, 13.16, 14.37, 13.24, 14.2… ## $ Malic_acid <dbl> 1.71, 1.78, 2.36, 1.95, 2.59, 1.76, 1.8… ## $ Ash <dbl> 2.43, 2.14, 2.67, 2.50, 2.87, 2.45, 2.4… ## $ Alcalinity_of_ash <dbl> 15.83, 11.58, 19.16, 17.61, 21.41, 15.8… ## $ Magnesium <dbl> 127.61, 100.48, 101.02, 113.96, 118.60,… ## $ Total_phenols <dbl> 2.80, 2.65, 2.80, 3.85, 2.80, 3.27, 2.5… ## $ Flavanoids <dbl> 3.06, 2.76, 3.24, 3.49, 2.69, 3.39, 2.5… ## $ Nonflavanoid_phenols <dbl> 0.28, 0.26, 0.30, 0.24, 0.39, 0.34, 0.3… ## $ Proanthocyanins <dbl> 2.29, 1.28, 2.81, 2.18, 1.82, 1.97, 1.9… ## $ Color_intensity <dbl> 5.64, 4.38, 5.68, 7.80, 4.32, 6.75, 5.2… ## $ Hue <dbl> 1.04, 1.05, 1.03, 0.86, 1.04, 1.05, 1.0… ## $ `OD280/OD315_of_diluted_wines` <dbl> 3.92, 3.40, 3.17, 3.45, 2.93, 2.85, 3.5… ## $ Proline <dbl> 1065, 1050, 1185, 1480, 735, 1450, 1290…

Observation

With the dplyr package, I used the %>% pipe operator to take the original dataset wineType, With rename(Region = Wine_Type), I specified that I want to rename the column “Wine_Type” to “Region” in the new dataset. The result is stored in a new dataset called wine_all, I then used glimpse(wine_all) to display a glimpse of the new dataset, confirming that the column name has been corrected.

#Task 3 # Create a box plot of Alcohol distribution per Region boxplot(Alcohol ~ Region, data = wine_all, main = "Alcohol Distribution per Region", xlab = "Region", ylab = "Alcohol Content")

Obsrvation

Region 2 has two outliers,minimum value is below ten, Q1,Q2,and Q3 while fourth quantile is twelve

#Task 4 # Calculate the global mean Alcohol global_mean_alcohol <- summarise(wine_all, Mean_Alcohol = mean(Alcohol)) # Display the value global_mean_alcohol$Mean_Alcohol

## [1] 13.37792

#Task 5 # Create wine2 data with group by and summarise wine2 <- wine_all %>% group_by(Region) %>% summarise(Mean_Alcohol = mean(Alcohol)) # Format the table wine2_table <- wine2 %>% kable(align = c("l", "c"), table.attr = "style='width:40%;'") %>% kable_classic_2() wine2_table

Region Mean_Alcohol

Region 1 13.74898

Region 2 11.18469

Region 3 16.16596

#Task 6 # Calculate the mean alcohol content per Region mean_alcohol <- tapply(wine_all$Alcohol, wine_all$Region, mean) # Create a table from the results wine2_table <- data.frame(Region = names(mean_alcohol), Mean_Alcohol = mean_alcohol) # Print the table print(wine2_table)

## Region Mean_Alcohol ## Region 1 Region 1 13.74898 ## Region 2 Region 2 11.18469 ## Region 3 Region 3 16.16596

#Task 7 # Create a new data set wine3 with corrected region names wine3 <- wine_all %>% mutate(Region = recode(Region, `1` = "California", `2` = "Colorado", `3` = "Massachusetts"))

#Task 8 # Create a table for the average Proline per Region proline_summary <- wine3 %>% group_by(Region) %>% summarise(Avg_Proline = mean(Proline, na.rm = TRUE)) %>% mutate(Region = factor(Region,levels = c("California", "Colorado", "Massachusetts"))) # Print the table with kable proline_table <- proline_summary %>% kable("html", digit=1, caption = "Average Proline per Region", align = "c") %>% kable_styling("striped") # Print the table proline_table

Average Proline per Region

Region Avg_Proline

NA 1115.7

NA 519.5

NA 629.9

# Create histograms for Proline per region histogram(~ Proline | Region, data = wine3, main = "Proline Distribution by Region", xlab = "Proline", layout = c(1, 3))

#task 9 # Count observations per category in Region region_counts <- wine3 %>% count(Region) %>% as.matrix() %>% kable("html", caption = "Observations per Category in Region", align = "c") %>% kable_styling("striped") # Print the table region_counts

Observations per Category in Region

Region n

Region 1 59

Region 2 71

Region 3 48

#Task 10 # Create three objects for magnesium classes magnesium_class1 <- wine3 %>% filter(Magnesium >= min(Magnesium) & Magnesium <100) magnesium_class2 <- wine3 %>% filter(Magnesium >= 100 & Magnesium<130) magnesium_class3 <- wine3 %>% filter(Magnesium >=130) # Set options to replace NA with '0' options(knitr.kable.NA = '0') # Create tables for each magnesium class table1 <- table(magnesium_class1$Region) table2 <- table(magnesium_class2$Region) table3 <- table(magnesium_class3$Region) # Combine tables into one data frame combined_table <- bind_rows( data.frame(Class = "Class 1", table1), data.frame(Class = "Class 2", table2), data.frame(Class = "Class 3", table3) ) # Improve table presentation using kable library(kableExtra) formatted_table <- kable(combined_table, align = "c") %>% kable_classic_2(full_width = FALSE) # Draw a pie chart library(ggplot2) pie_chart <- ggplot(combined_table, aes(x = "", y = Freq, fill = Region)) + geom_bar(stat = "identity") + coord_polar("y", start = 0) + labs(title = "Distribution of Regions in Magnesium Classes") # Display the formatted table and pie chart formatted_table

Class Var1 Freq

Class 1 Region 1 16

Class 1 Region 2 53

Class 1 Region 3 28

Class 2 Region 1 42

Class 2 Region 2 13

Class 2 Region 3 20

Class 3 Region 1 1

Class 3 Region 2 5

#Task 11 # Create tables for each magnesium class table_class1 <- table(magnesium_class1$Region) table_class2 <- table(magnesium_class2$Region) table_class3 <- table(magnesium_class3$Region) # Bind all three tables together combined_tables <- bind_rows( data.frame(Class = "Low", table_class1), data.frame(Class = "Medium", table_class2), data.frame(Class = "High", table_class3) ) # Transform to a data frame combined_df <- as.data.frame(combined_tables) # Set NA values to empty space options(knitr.kable.NA = '') # Print the improved table kable(combined_df, format = "pipe", caption = "observation Magnesium Classes")

observation Magnesium Classes

Class Var1 Freq

Low Region 1 16

Low Region 2 53

Low Region 3 28

Medium Region 1 42

Medium Region 2 13

Medium Region 3 20

High Region 1 1

High Region 2 5

CONCLUSION

The code provided in the previous sections aimed to analyse a dataset related to wine regions and characteristics. The following is a summary of the results and observations obtained from the code:

Data Loading and Preliminary Exploration:

The dataset was loaded and renamed to correct the variable name.

A glimpse of the dataset was taken to understand its structure.

Data Cleaning:

Region names were corrected from numerical values to their respective state names (California, Colorado, and Massachusetts).

Descriptive Statistics and Visualisation:

Box plots were used to visualise the distribution of alcohol content per region.

A table was created to present the average Proline content per region.

Histograms were generated to visualize Proline distribution by region.

Frequency Analysis:

The count of observations per region category was computed and presented in a table.

Magnesium Class Analysis:

The dataset was filtered into three classes based on magnesium content.

Tables were created to count the observations per region within each magnesium class.

All three tables were combined and presented.

Project Direction and Observations:

The analysis provided insights into the distribution of alcohol and proline content across different wine regions.

Differences in alcohol content were visualised using box plots.

The average Proline content varied by region and was presented in a table.

The proline distribution was visualised using histograms. They were showing potential differences.

The frequency analysis highlighted the number of observations per region category.

The magnesium class analysis showed variations in observations within each class.

New skills gained:

Improved data manipulation skills using dplyr functions.

Enhanced data visualisation using lattice and kableExtra libraries.

Efficiently handling data summarization and presentation.

Recommendations:

Further analysis could explore relationships between wine characteristics and regions.

Identifying outliers and their impact on wine quality could be valuable.

Additional variables related to wine quality could be considered for analysis.

Strategy for Organisation:

The data was loaded and explored step by step.

Every analysis task was divided into doable portions.

Using the dplyr and kableExtra libraries consistently ensured effective presentation and manipulation of data.

Documentation and comments were used to explain each code section.

The analysis provided valuable insights into wine characteristics across different regions. More investigation and Research may result in a deeper comprehension and even serve as a guide. decision-making in the wine industry or research institutions. The project necessary abilities in data manipulation, systematic organisation, and an efficient visualisation technique

REFERENCES

Jackson, R. S. (2008). Wine Science: Principles and Applications. Academic Press.
Hardie, W. J., & Considine, J. A. (1976). Magnesium nutrition of grapevines. Journal of the American Society for Horticultural Science, 101(4), 411-415.
Trochim, W. M. (2006). Descriptive statistics. Research methods knowledge base. Link
Everitt, B. S., & Skrondal, A. (2010). The Cambridge Dictionary of Statistics. Cambridge University Press.
Wickham, H., & Grolemund, G. (2017). R for Data Science. O’Reilly Media.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.
Dee Chiluiza,(2021) Introduction to data analysis using R, R Studio and R Markdown https://rpubs.com/Dee_Chiluiza/816756