MH3511 PROJECT DESCRIPTION

📊 Individual Project: Data Analysis Report and Presentation

This project contributes 20% to your final course grade. You will work individually on this task.

🎯 Objective

Choose a publicly available dataset and apply the statistical and data analysis techniques learned in this course. The project is designed to assess your ability to handle real-world data, conduct appropriate statistical analysis, and communicate your findings effectively.

🧭 Project Tasks

You are required to:

Import and prepare the data in R (e.g., using read_csv(), read_excel(), etc).
This includes:
- Classifying variables into categorical, ordinal, numeric interval, or numeric ratio scales.
- Cleaning the data, such as:
  - Handling or removing missing values when appropriate.
  - Identifying and removing duplicates.
  - Merging redundant or inconsistent factor levels where necessary for analysis.
Generate appropriate summary statistics (e.g., using summary(), skimr::skim(), or dplyr summaries).
This step should also involve identifying potential outliers and choosing an appropriate strategy for handling them.
Create a meaningful visualization, using ggplot2 or other visualization libraries.
Make sure to:
- Clearly communicate the insights from the plot.
- Address any outliers appropriately in the visual display.
Perform a goodness-of-fit test to evaluate whether a theoretical distribution fits the data (e.g., QQ plot, Kolmogorov–Smirnov test, chi-squared test).
Conduct an appropriate statistical test (e.g., chi-squared test, t-test, Mann–Whitney U test, ANOVA, regression, correlation, etc.) and interpret the result in a real-world context.
Write a concise report (maximum 4 pages, not including code) using R Markdown, clearly explaining your methodology, analysis, and conclusions. Your report should be written in clear English with full sentences and logical structure.
Prepare a short presentation (maximum 4 slides) summarizing your findings, created in R Markdown (e.g., xaringan, ioslides, or beamer).
You will present this to the class, and it should be engaging, well-paced (no more than 5 minutes), and demonstrate your understanding. You may be asked questions by peers and the instructor.

🗓️ Timeline

You will work on your project throughout the course. In each class, you will have the opportunity to apply the skills learned that week to your project.

Presentations will take place in Class 9 on Wednesday, 16 July.
Your final report is due by Thursday, 17 July.

Individual Project Rubric (20% of Final Grade)

Criteria	Excellent (100%)	Good (75%)	Satisfactory (50%)	Poor (25%)	Unsatisfactory (0%)	Weight
1. Data Import	Data is correctly imported from a public source using R; code is clean and reproducible.	Data is imported with minor issues or excessive code complexity.	Data is imported but with help or manual intervention.	Attempt made but not working correctly.	No attempt or completely incorrect.	10%
2. Summary Statistics	Includes appropriate numeric summaries; well-formatted and meaningful.	Basic summaries are included but may lack completeness or formatting.	Some summary stats present but minimal or poorly presented.	Attempted but mostly incorrect.	Not attempted.	10%
3. Visualization	Relevant, correctly implemented ggplot2 visualizations; clearly interpreted in the report.	Visualization is relevant but lacks explanation or polish.	Basic plot is shown without interpretation.	Poor or incorrect use of visualizations.	Not attempted.	10%
4. Goodness of Fit Test	Appropriate theoretical distribution tested; test implemented correctly; result clearly interpreted.	Test implemented with minor issues or weak interpretation.	Correct test used but poorly explained.	Attempted but wrong test or flawed implementation.	Not attempted.	15%
5. Statistical Test	Suitable test chosen and explained (e.g. t-test, ANOVA, etc.); real-world conclusion is correct.	Test mostly correct but interpretation is unclear or incomplete.	Basic test done with weak link to context.	Attempted but test choice or conclusion is wrong.	Not attempted.	15%
6. Report	Clear, concise, structured, in full sentences; fits 4-page limit and includes all sections.	Mostly clear but has structural or language issues.	Understandable but poorly structured or verbose.	Hard to follow or poorly written.	Missing or completely incoherent.	10%
7. Presentation	Within 5 minutes; engaging; clear slides; speaker answers questions well.	Mostly clear; minor issues with timing or Q&A.	Presentation is basic or lacks clarity.	Poorly presented or hard to understand.	Not presented.	10%
8. Extras	Project includes additional material outside the scope (e.g., networks, power law, own idea).	Attempt at additional material is relevant but shallow.	Some creativity or extras, not well executed.	Extras attempted but irrelevant or poorly done.	No extras.	20%

Additional Notes

R Programming Quality: The quality of R programming will influence scores across all criteria. For example:
- Using unnecessary loops may reduce the score for Data Import or Summary Statistics.
- Avoiding tidyverse tools without reason or using awkward base R constructions may reduce the score for Visualization.
- Using raw R output instead of nicely formatted tables (e.g., kable, gt, flextable) may reduce the score for the Report.
Use of Generative AI Tools: Students are allowed to use generative AI tools (e.g., ChatGPT, Copilot) to:
- Improve grammar and writing style.
- Help write clean and idiomatic R code.
- Generate boilerplate text or code snippets.
However, all use of AI tools must be declared in the final submission. Students remain fully responsible for the accuracy, coherence, and originality of their work.
Grading Expectations:
- Items 1–7 are core competencies. Most students are expected to score well here unless they significantly misunderstand or omit requirements.
- Item 8 (Extras) is intentionally more challenging. Only a few students are expected to achieve full marks for this component.
- Even if a student scores 0% on Extras, they can still achieve an A+ (85%) by performing well on the rest.
Peer Grading:
- A simple peer assessment will be conducted during the presentations using a 1–5 Likert scale.
- Peer feedback will be considered as part of the Presentation score, especially for engagement and clarity.

Declaration of AI usage

Fedor used ChatGPT 4o to generate this document using his prompts. Specifically, Fedor outlined the main structure — the scope of the project, the list of tasks, the list of grading rubrics and their weights, and the list of additional notes. ChatGPT was then used to improve grammar and clarity, to fill the rubric matrix with specific descriptions, and to format everything in markdown.