This project contributes 20% to your final course grade. You will work individually on this task.
Choose a publicly available dataset and apply the statistical and data analysis techniques learned in this course. The project is designed to assess your ability to handle real-world data, conduct appropriate statistical analysis, and communicate your findings effectively.
You are required to:
Import and prepare the data in R (e.g., using
read_csv()
, read_excel()
, etc).
This includes:
Generate appropriate summary statistics (e.g.,
using summary()
, skimr::skim()
, or
dplyr
summaries).
This step should also involve identifying potential
outliers and choosing an appropriate strategy
for handling them.
Create a meaningful visualization, using
ggplot2
or other visualization libraries.
Make sure to:
Perform a goodness-of-fit test to evaluate whether a theoretical distribution fits the data (e.g., QQ plot, Kolmogorov–Smirnov test, chi-squared test).
Conduct an appropriate statistical test (e.g., chi-squared test, t-test, Mann–Whitney U test, ANOVA, regression, correlation, etc.) and interpret the result in a real-world context.
Write a concise report (maximum 4 pages, not including code) using R Markdown, clearly explaining your methodology, analysis, and conclusions. Your report should be written in clear English with full sentences and logical structure.
Prepare a short presentation (maximum 4
slides) summarizing your findings, created in R
Markdown (e.g., xaringan
, ioslides
,
or beamer
).
You will present this to the class, and it should be engaging,
well-paced (no more than 5 minutes), and demonstrate
your understanding. You may be asked questions by peers and the
instructor.
You will work on your project throughout the course. In each class, you will have the opportunity to apply the skills learned that week to your project.
Criteria | Excellent (100%) | Good (75%) | Satisfactory (50%) | Poor (25%) | Unsatisfactory (0%) | Weight |
---|---|---|---|---|---|---|
1. Data Import | Data is correctly imported from a public source using R; code is clean and reproducible. | Data is imported with minor issues or excessive code complexity. | Data is imported but with help or manual intervention. | Attempt made but not working correctly. | No attempt or completely incorrect. | 10% |
2. Summary Statistics | Includes appropriate numeric summaries; well-formatted and meaningful. | Basic summaries are included but may lack completeness or formatting. | Some summary stats present but minimal or poorly presented. | Attempted but mostly incorrect. | Not attempted. | 10% |
3. Visualization | Relevant, correctly implemented ggplot2 visualizations; clearly interpreted in the report. | Visualization is relevant but lacks explanation or polish. | Basic plot is shown without interpretation. | Poor or incorrect use of visualizations. | Not attempted. | 10% |
4. Goodness of Fit Test | Appropriate theoretical distribution tested; test implemented correctly; result clearly interpreted. | Test implemented with minor issues or weak interpretation. | Correct test used but poorly explained. | Attempted but wrong test or flawed implementation. | Not attempted. | 15% |
5. Statistical Test | Suitable test chosen and explained (e.g. t-test, ANOVA, etc.); real-world conclusion is correct. | Test mostly correct but interpretation is unclear or incomplete. | Basic test done with weak link to context. | Attempted but test choice or conclusion is wrong. | Not attempted. | 15% |
6. Report | Clear, concise, structured, in full sentences; fits 4-page limit and includes all sections. | Mostly clear but has structural or language issues. | Understandable but poorly structured or verbose. | Hard to follow or poorly written. | Missing or completely incoherent. | 10% |
7. Presentation | Within 5 minutes; engaging; clear slides; speaker answers questions well. | Mostly clear; minor issues with timing or Q&A. | Presentation is basic or lacks clarity. | Poorly presented or hard to understand. | Not presented. | 10% |
8. Extras | Project includes additional material outside the scope (e.g., networks, power law, own idea). | Attempt at additional material is relevant but shallow. | Some creativity or extras, not well executed. | Extras attempted but irrelevant or poorly done. | No extras. | 20% |
R Programming Quality: The quality of R programming will influence scores across all criteria. For example:
kable
, gt
, flextable
) may reduce
the score for the Report.Use of Generative AI Tools: Students are allowed to use generative AI tools (e.g., ChatGPT, Copilot) to:
However, all use of AI tools must be declared in the final submission. Students remain fully responsible for the accuracy, coherence, and originality of their work.
Grading Expectations:
Peer Grading:
Fedor used ChatGPT 4o to generate this document using his prompts. Specifically, Fedor outlined the main structure — the scope of the project, the list of tasks, the list of grading rubrics and their weights, and the list of additional notes. ChatGPT was then used to improve grammar and clarity, to fill the rubric matrix with specific descriptions, and to format everything in markdown.