DS101 Fundamentals of Data Science – Formative Assignment 1

Author

Vansh Niraj Bijlani

Task 1: When numbers mislead

Q1: According to (Greenland et al. 2016), what are two common misinterpretations of p-values? Why do these misunderstandings persist in practice?

Statistical analysis is central to data science, since one of the core purposes of data science is to understand complex data-sets in order to drive actionable insights and inform decision-making by identifying patterns, and making hypotheses from those patterns to apply to real-life issues (links to Week 1 Lecture). However, one of the foundational tools we use, the p-value, is often a misunderstood/misinterpreted value, which can dilute the strength of the conclusions we draw according to Greenland et al. (2016). In data science we use test-statistics such as t-statistics or a Chi-squared statistics to measure the distance and accuracy of our estimated model with the wider data set. Therefore to understand the misinterpretations of p-values, Greenland et al. (2016) define the p-value as “the probability that the chosen test statistic (such as t-stat or Chi square) would have been at least as large as its observed value if every model assumption were correct, including the test hypothesis.” In simpler terms, a p-value tells us how surprising our data is, if the null hypothesis (the hypothesis assuming no effect/difference) and all other model assumptions i.e random sampling, correct model specification (choosing the correct variables and functional form of the relationship) etc. were true (links to Week 4 Lecture). A low p-value means that, if the null hypothesis (H₀) were true, the observed data would be unlikely, suggesting the results are inconsistent with H₀ and providing indirect evidence for a real effect. Conversely, a high p-value means the data is consistent with what we’d expect under H₀. Most importantly, the p-value does not tell us anything directly about whether the actual hypothesis is true or false, only how surprising the data is assuming H₀ is true.

This brings us to one of the most widespread and damaging misinterpretations of the p-value, which is that the “P value is the probability that the test hypothesis is true”. This misinterpretation leads researchers to conclude, for example, a p-value of 0.01 means there is a 1% chance the null hypothesis is true, which is categorically false. The p-value is calculated under the assumption that the null hypothesis is true therefore it cannot tell us the probability of the actual hypothesis being true or false. A good way to understand it is as the conditional probability of the data (the condition: holding H₀ to be true) and not as the misconceived idea of it being the probability of the null hypothesis give the data. To illustrate this, imagine flipping a fair coin (50% chance of head/tail) 10 times and observing 10 heads in a row. The p-value for getting this result under the null hypothesis would be extremely low (i.e 0.0001), yet, this doesn’t mean the probability of the coin being fair is 0.01%. Much rather, it only tells us how surprising/likely our observed result was, if the coin was fair under our H₀. It doesn’t account for external factors such as experimental bias or selective reporting which may implicate real-world outcomes. Greenland et al. (2016) notes that “it is not a hypothesis probability and may be far from any reasonable probability for the test hypothesis.” The correct interpretation (for p = 0.01) would be, holding the null hypothesis and all other assumptions to be true, there is a 1% chance of obtaining results as extreme as those I observed, not that there is a 1% chance the null is true. Another variation of this is that “chance alone produced the association” which is equally wrong since p-value measures how surprising the data is if chance were the only thing occurring, and now how likely chance caused the data. The reason this misunderstanding persists is because “no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof” according to Greenland et al. (2016). In my understanding, scientists skip steps to draw conclusions with p-values due to cognitive ease since they want to reason about hypotheses more straightforwardly instead of conditional and unintuitive conclusions that p-values provide. They falter with the directionality of the relationship, which is that the p-value is the probability of the data given the null hypothesis and not the probability of the null hypothesis given the data.

Another fundamental misconception is that a non-significant result (p > 0.05) means there is no effect, or that the null hypothesis is true. Simply, failing to reach “statistical significance” does not mean that there is “no effect.” It only means that the evidence isn’t strong enough (under this model) to confidently rule out chance as a factor and that could be because the study was underpowered, or the effect was small, or even because of random error. This persists due to binary thinking since researchers look at p < 0.05 with the idea that a real effect is being observed and p > 0.05 with no real effect between variables existing. Greenland et al. (2016) quotes that “Even if the test hypothesis is wrong, the P value may be large because it was inflated by a large random error or because of some other erroneous assumption.” I believe this misinterpretation persists due to overreliance on arbitrary thresholds (such as 0.05 cutoffs). Scientific publishing, media, and educational practices (even in my own experience from university) frequently reinforce the idea that results must clear this threshold to be meaningful (if above 0.05 we reject the null, and if below we fail to reject the null), leading to false discarding of conclusions and underreporting of potentially important research findings. This oversimplified, rigid and all-or-nothing approach leaves no room for critical thinking about what the p-values actually mean. Greenland et al. (2016) notes the same in their conclusion saying “we join others in singling out the degradation of P values into “significant” and “nonsignificant” as an especially pernicious statistical practice”, showing that scientific reporting of data has evolved into obtaining p-value results that prove or disprove hypotheses instead of using the data to lead us to independent conclusions, which has eroded the validity of data-driven conclusions and has major long-term consequences.

While Greenland et al. identify 25 misinterpretations, I selected these two because they reflect core conceptual errors about what p-values are and how they should be interpreted as well as the gravity of potential error that these can cause. They also interact with one another, since researchers who misinterpret non-significance to 0.05 as “no effect” are likely to publish false negatives, and those who misinterpret small p-values as probability of truth in their hypotheses may inflate false positives. Greenland et al. (2016) urge us to abandon these fallacies and replace binary thinking with context-sensitive interpretation that considers effect sizes, confidence intervals, and assumptions underlying models, as well as being mindful when comparing p-values to one another as well. While a low p-value may indicate the data as surprising under the null hypothesis (𝐻₀), it does not imply practical significance, as p-values are sensitive to sample sizes and a big sample can make smaller effects look statistically significant, while a small sample may fail to detect substantial effects due to low statistical power (small n).

Q2: (Nuzzo 2014) argues that researchers often treat p-values as a “gold standard.” How does this mindset connect to the problems (Greenland et al. 2016) describes?

Nuzzo’s 2014 article “Scientific Method: Statistical Errors” is a cautionary observation of the p-value which is long treated as the “gold standard” of statistical inference and is fundamentally misunderstood/misapplied. Nuzzo warns that researchers “cling” to smaller p-values as if they simply guarantee the truth in hypotheses and these values are “not doing their job, because they can’t” (quoting Ziliak, an economist at Roosevelt University in Chicago). She critiques the technical use of the p-value as well as the mindset of scientists that make p-value a definitive arbiter of scientific discovery through dichotomous thinking. This directly links to the problems identified by Greenland et al. (2016) wherein they argue that misinterpretations “remain rampant” because of “shortcut definitions and interpretations that are simply wrong, sometimes disastrously so.” Both researchers connect to the core problem which is treating p-values as objective indicators of truth when in reality, it’s epistemologically flawed. An example of this is the case of Matt Motyl, that Nuzzo describes, wherein Motyl’s initial research showed a p-value of 0.01 in his data which is statistically significant but when it was replicated with more data, it failed with the p-value of 0.59, revealing the original study’s results to be deeply inaccurate. If the Motyl study was published, it could lead to overconfident but misleading claims about political extremism (Motyl’s area of research). Greenland et al. (2016) expands Nuzzo’s critique by mapping out 25 misinterpretations of p-values but one that stands out is “The P value is the probability that the test hypothesis is true.” It parallels Nuzzo’s claim that scientists mistakenly read p-values as indicating the probability of truth. In reality the p-value tests all other assumptions to also be true about how the data was generated, not just the null hypothesis, however these other assumptions are rarely scrutinized which leads to this polarized thinking.

Nuzzo explains that the 0.05 threshold became popular because of an accidental blending of two different statistical ideas, one by Fisher and the other by Neyman and Pearson which were introduced in the early-mid 1900s. Nuzzo notes that the dilution of their ideas into what’s called ‘p-hacking’ has become increasingly problematic. She says Neyman and Pearson noted that “The tests themselves give no final verdict, but as tools to help the worker who is using them to form his final decision” and Fisher notes “No scientific worker has a fixed level of significance.” These founding ideas were meant to be guidelines and considered as one piece of evidence, however Greenland et al. (2016) explains that this idea of p-hacking (he doesn’t use this term specifically but implies it) is an immense degradation of statistical reasoning into yes/no decisions. Both researchers acknowledge that the misuse of p-values isn’t just a technical issue since it’s reinforced by publication pressures and educational norms.

I would note that Nuzzo suggests it’s a cultural habit, viewing the problem more behavioral and psychologically whereas Greenland et al. proposes statistical discipline reform and has a more granular, technical approach to solving the problem. Nuzzo is a proponent of Bayesian alternatives, highlighting how Bayesian reasoning can address some limitations of p-values however Greenland et al. argues that due to “philosophical objections” and “no conventions” being established Bayesian falls short for widespread use.

Both Nuzzo (2014) and Greenland et al. (2016) illuminate how the treatment of p-values as a “gold standard” distorts scientific reasoning. They agree that this overreliance stems from deep-rooted misunderstandings and is reinforced by institutional pressures, dichotomous thinking, and poor statistical education. However, they differ in emphasis, where Nuzzo draws attention to how scientists behave under social and professional incentives, Greenland et al. provides a technical, model-based critique displaying the misunderstandings and recommends providing a detailed log of all the assumptions in research, including all the methods used so audiences can understand the p-value alongside all the context. Despite these minor differences, both authors converge on the key insight that p-values were never meant to serve as a standalone measure of research validity. Moving forward, both authors call for a shift in how we interpret evidence, which is to stray away from simplistic thresholds like p < 0.05 and toward more case-sensitive, transparent, and context-based reasoning. Science doesn’t advance with arbitrary cutoffs, and we need independent scientific judgment, honest communication of limitations, and reflection on the assumptions we build into our analyses (link to Week 1 Lecture).

Q3: (Nuzzo 2015) identifies cognitive biases that can mislead researchers. Which of these biases might encourage misuse of p-values?

Nuzzo (2015) identifies and explains several cognitive biases which directly encourage the misuse/misinterpretation of p-values, particularly by distorting how researchers are supposed to interpret, analyze, and report data. She explains that these biases don’t stem from malintent or fraudulent scientific guidelines, but rather from the normal functioning of the human brain and human psychology which has evolved to seek patterns and shortcuts, even when none exist. She claims that “Humans are remarkably good at self-deception” and highlights the idea of natural fallibility/cognitive biases wherein “If nothing seems wrong, it’s easier to miss it.” When paired with the academic pressure, competitive incentives, and complex statistical tools around p-value analysis, these biases become especially dangerous and must be understood in order to mitigate. She claims that if we don’t identify these biases in research and simply jump to conclusions, ignore alternative explanations and assume information that looks reasonable is the same as accurate we will “ceaselessly lead ourselves astray without realizing it.” The reproduction of research studies is what makes us believe data-driven research conclusions can pass the test of time and build on one another to be contemporary as well as vastly applicable. However, recent studies in Nuzzo’s article show us that reproduction in specific domains have failed severely (i.e only being able to replicate 6 out of 53 landmark studies in healthcare), pointing to bias as a well-established explanation for this phenomena.

The first core bias Nuzzo notes is called hypothesis myopia, which refers to the tendency to fixate on a single favored explanation and seek only confirming evidence, while neglecting alternatives. Hypothesis myopia encourages the idea of ‘confirmation bias’ in statistical testing and analysis. This simply means that researchers may run analyses solely to find p-values that support their pre-existing hypothesis, and will ignore results or will not diligently consider those that don’t. Since p-values are designed to test how surprising data is assuming the null hypothesis and other assumptions, and not how true a preferred theory is when compared to another, this use distorts their meaning and takes away from independent study of a topic. An apt explanation is one shown in Nuzzo’s article by Baron, a psychologist from the University of Pennsylvania who says “People tend to ask questions that give ‘yes’ answers if their favoured hypothesis is true.” An experiment Baron uses as an example is where researchers test whether exposure to a messy room increases moral harshness to find out whether “disgust” is a contributing factor to this harshness. However, once they get statistically significant results for “disgust” as a factor they stop, failing to consider other emotions that caused this reaction such as sadness, anger or shame. The p-value supports a narrow interpretation that may ignore better explanations entirely and “researchers might be missing the real story entirely.”
Another important cognitive trap is the Texas sharpshooter fallacy, wherein researchers unintentionally find patterns in noisy data by testing many variables and only highlighting the ones that yield low p-values. This leads to the idea of p-hacking which is the unknowing or conscious manipulation through data analysis techniques (i.e selectively including data that helps our claims, rerunning tests intending a specific result) to achieve statistical significance. The Texas sharpshooter reinforces how common p-hacking is, and Nuzzo notes that in a 2012 study of 2,000 psychologists 50% selectively reported only studies that ‘worked’ in their favour, close to 60% saw the results of their study first and then decided whether to collect more data or to stop. However most shockingly, “43% had decided to throw out data only after checking its impact on the p-value”. This describes a phenomenon in the data analysis industry coined by psychologist Kerr as “HARKing, or hypothesizing after results are known”, which is extremely dangerous and can lead to high amounts of spurious conclusions/inferences from research studies. As Nuzzo quotes psychologist Simonsohn, this bias is about “exploiting researcher degrees of freedom until p < 0.05.” An example is the chocolate-and-weight-loss study that Nuzzo highlights in her article and she demonstrates this by describing how creative p-hacking by running enough tests, researchers can find at least a few results that are statistically significant purely by chance. In this case out of 18 measurements, three significant results by chance alone were statistically significant and stood out which led to selectively reporting only those. This cherry-picking of data makes researchers present random noise/irrelevant data-driven claims as meaningful discovery, which completely misrepresents what the p-value actually means.
Asymmetric attention, also called disconfirmation bias, is another bias which leads to researchers scrutinising unexpected or non-significant findings more rigorously (results that might increase p-value or reduce the statistical significance of their claims), while giving a “free pass” and letting through results that align with their expectations. This asymmetry inflates the number of falsely significant p-values, because only the favorable outcomes pass through unchecked. A 2011 review in Nuzzo’s article showed that “90% of the mistakes were in favour of the researchers’ expectations, making a non-significant finding significant.” This inflates false positives and undermines the objectivity of statistical inference. This bias explains that we double-check our data only when it doesn’t show what we hoped for, but as long as the data aligns with our expectations we don’t seem to question its limitations and the scrutiny of its shortcomings.
Lastly, the final bias is called just-so storytelling (and its variation JARKing, which means justifying after results are known) encourages researchers to build post-hoc narratives around significant results, even when those results are statistically weak or unexpected. This can lead to unjustified interpretations of borderline p-values (e.g., p = 0.099 being described as “approaching significance”) and undermines the meaning of the p-value as a pre-specified, objective metric. This bias leads researchers to reverse-engineering explanations for results that may be due to chance. It also causes them to overinterpret marginal p-values (e.g. p = 0.08) and confuses the threshold that p-values are meant to define as well as outputs researcher-driven biased results incorrectly. This bias can mislead readers about the strength of the evidence since we are first analysing the data and linking our hypothesis to the data-set after checking for p-values’ statistical significance which is wrong morally and according to the scientific principles on conducting independent research.

P-values are meant to help assess how compatible data is within the assumption of the null hypothesis, assuming everything else is done rigorously and without bias. But when these biases influence which hypotheses are tested, how data is analyzed, and how results are interpreted, the interpretation of those p-values become meaningless since they no longer reflect statistical evidence but rather they reflect the researcher’s unconscious preferences. While credible, this article leans heavily toward one side of the debate and it doesn’t seriously consider opposing views such as defenders of p-values, or critiques of this newly recommended model that discards older p-values. Additionally, fields with large datasets (like genetic data studies) would face very different statistical pressures from biases when comparing that to hypothesis-driven psychology experiments, and this distinction isn’t explored in detail by Nuzzo. A final limitation of Nuzzo’s article is that she focuses only on four biases she considers most important, overlooking other critical ones in data science such as the Bandwagon Effect, Anchoring Bias, and Survivorship Bias, which could also significantly distort p-value use and scientific inference.

Q4: Bohannon’s chocolate hoax (Bohannon 2015) highlights failures in study design and media reporting.

Identify two major red flags in the study (beyond just “a bad p-value”).
For each red flag, explain:
- Why is it a problem for scientific validity?
- What consequences it had for how the findings were interpreted by the public and media.
- How the issue could have been avoided or corrected through better study design, analysis, or reporting.

Bohannon’s chocolate hoax (Bohannon 2015) shows a case study in how weak study design and poor scientific communication can rapidly metastasize into global misinformation. This fabricated study, which claimed that eating chocolate daily accelerates weight loss became international headline news by some popular publications such as the Times of India despite being intentionally designed to fail just to test how widespread dissemination occurs without validation/verification of study results. Beyond just producing a meaningless p-value, Bohannon identifies two critical red flags that illuminate deeper structural issues in both scientific research and media ecosystems. The first is a dangerously small sample size with multiple ‘chance’ based correlations found purely caused by Texas sharpshooter fallacy, and the second is, the disregard of considering confounding variables and stating these limitations clearly, which leads to media publications posting the relationship between chocolate and weight-loss to sound ‘causal’ when it isn’t even merely correlational.

The first and perhaps most egregious red flag in the chocolate study is its combination of a tiny sample size (n = 15) and excessive data selectivity across 18 outcome variables. This is a perfect example to demonstrate an experimental design that sets up false positives and statistical illusion. Bohannon openly acknowledges this tactic, likening the study design to buying “lottery tickets,” where the more variables measured lead to a higher likelihood of hitting a “statistically significant” result purely by chance (Bohannon, 2015). Statistical significance (normally p < 0.05) loses all inferential value when many tests are run on a small sample with multiple variables hoping just a few show statistical significance. The study’s use of 18 outcome measures on 15 people yielded a 60% chance of generating at least one spurious p-value below 0.05 and this was the exact case in this experiment. The small sample size further compounds the problem, as random noise from biological variation (such as fluctuations in body weight from hydration or menstrual cycles) overwhelms any reliable signal and conclusion that can be drawn from the data. This is a major problem in data science since it doesn’t allow us to reproduce those same data results at a larger scale with a bigger sample size. Therefore, these results that were obtained would be hard to analyse in a larger sense to draw implications and most importantly it is highly randomly varied which makes the conclusions even weaker. With no pre-registration of hypotheses, this violates nearly all standards of responsible statistical inference and is highly dangerous since it can lead to misinformation. These design flaws were entirely invisible to the public. Most headlines proclaimed the study as proof that “chocolate accelerates weight loss” without reporting basic methodological facts, such as the number of participants or the lack of replication. Media outlets failed to question at all whether the results were clinically meaningful or generalizable. Without context, the p-values were interpreted as truth rather than a function of a flawed experimental design. This links back to previous readings explored in this 1st task about how the p-value interpretation usually gets diluted due to cognitive ease and difficulty to analyse it as one metric out of many, and not just the only metric for statistical significance. The public was misled into thinking chocolate had causally beneficial effects on metabolism, undermining both trust in nutrition science and personal health choices. When we take this experiment forward we need to think about what problems this misinformation can form? People can make dietary changes and decisions based on spurious conclusions drawn from this research and it can have real-life health implications for many if relied on blindly. This issue could have been avoided through basic principles of study design that align with how experiments such as this should be conducted. Examples of this in my understanding should include utilising an adequately large sample size (for example 30-50+ people per group), which are balanced for gender and age to draw more accurate explanations. Perhaps children need the sugar in chocolate for energy and growth which is burned faster in their bodies than for working adults that have sedentary lifestyles. Another way would be pre-registration of hypotheses and the data study/experimental design plans to prevent hoping for an outcome caused by luck or p-hacking. Lastly, correcting the multiple comparisons and focussing on specific variables, instead of looking for a specific answer or desired result just to get more media coverage is a more psychological and structural correction that is reinforced by media but should not be encouraged. Reducing false positives and being completely transparent in the entire process of the research, while showing the limitations and highlighting the limited applicability of the study clearly should’ve been at the front and centre so causal conclusions cannot be drawn. Publication standards like open data review would allow others to replicate findings more accurately and evaluate their robustness too since they allow transparency. As Greenland et al. (2016) emphasizes, without full transparency about the assumptions built into a study, p-values become epistemologically meaningless even if statistically significant.

Another deeply embedded flaw in the chocolate study is its complete disregard for controlling or even acknowledging confounding variables, or variables that are often seen as uncontrollable factors. What are factors other than the chocolate that could influence outcomes such as weight loss or cholesterol levels? Confounding variables are a core threat to internal validity in any study, particularly in nutrition science where individual variation such as hormonal cycles, hydration, exercise, stress and sleep are deeply core variables that aren’t considered strongly enough here. Bohannon openly admits the team made no effort to balance age, sex, or lifestyle across groups, nor to monitor the actual diet or calorie intake of participants during the 3-week study. For example, as he notes, a woman’s body weight can naturally fluctuate by up to 5 pounds due to her menstrual cycle, more than the reported effect size in the study which is a shocking red-flag to me. Without standardizing for such variation or tracking participants’ behavior beyond the assigned diet, there’s no way to attribute observed changes to the chocolate intervention or variable. This flaw renders the supposed “causal” link between chocolate and weight loss scientifically meaningless. Since the study reported “better cholesterol” and “improved well-being” alongside weight loss, these multidimensional claims fed into confirmation bias narratives for both journalists and readers to disseminate further, without any scientific basis or validation/review. The media reported these secondary outcomes as further proof that chocolate is beneficial, unaware that the results could easily be artifacts of dietary variance, or measurement noise. To address this design flaw, the study should have collected detailed baseline and behavioral data (like calorie intake, exercise, menstrual cycle tracking). Another method could’ve been to consider longer-term follow-up to observe whether changes persisted beyond short-term noise (make it a longer-term observational study.) These would have ensured that any observed differences could be more confidently attributed to the intervention rather than uncontrolled variation.

In conclusion, while my response focused on two core red flags, it’s also important to acknowledge that other equally significant issues surround the study. One of the biggest limitations of my analysis is that I didn’t explore how publication incentives, journal reputation, or statistical illiteracy in journalism contributed to the hoax’s success. Bohannon makes it clear that the media latched onto the story without checking even the most basic facts, and that reflects a much wider problem in data science which is that we often accept statistical significance (p < 0.05) as a shortcut for truth without understanding its context. This is how causal-sounding headlines go viral from findings that aren’t even properly correlational. The public rarely sees the flaws beneath the surface such as poor design of the experiment, the data set being used and a more holistic understanding of the problem. This hoax forces us to rethink the entire ecosystem that allows flawed science to appear credible and the psychological side of why scientists just focus on p-values, which is that they get rewarded if they find statistical evidence and draw more eyes to their research. It reinforces that we need better standards of transparency, responsible reporting, and statistical education for both scientists but also more so if not equally the journalists. The takeaway here is that without safeguards and critical thinking from both of these stakeholders (scientists and journalists), bad scientific design can look good enough to persuade large audiences in misleading directions.

Q5: Imagine you are a journal editor: What one policy change would you implement to reduce both statistical misuse (Greenland et al. 2016) and self-deception/hype [Nuzzo (2015)](Bohannon 2015) in published research?

If I could implement one policy change as a journal editor, it would be to require registered reports for all empirical submissions, meaning studies must be accepted first based on their design and research question before data is collected. This one shift directly targets both statistical misuse and self-deception, integrating key concerns from Greenland et al. (2016), Nuzzo (2015), and Bohannon (2015) while offering structural reinforcement against a wide range of biases.

To arrive at the choice of registered reports, I carefully considered several other promising ideas but found each lacking in scope or feasibility to make these widescale enough. One strong consideration was requiring open data transparency and full analytical methodology transparency, which both Nuzzo and Greenland et al. (2016) support as a way to improve accountability and reproducibility. This approach allows other researchers to verify results, reanalyze data, and catch errors or misrepresentations. However, it is ultimately a reactive safeguard. This simply means that it’s only useful only after publication and assumes that enough qualified people will actually take the time to audit the work, which is rarely the case. This would need strong guidelines and reward mechanisms that encourage researchers to focus on their roles as auditors instead of independently undertaking other creative research that they have ownership over.

Another option was blind data analysis (also known as a double-blind experimental method), which Nuzzo highlights through examples in physics where she talks about the risk of confirmation biases. This is where researchers analyze data without knowing which condition or group it belongs to thereby reducing unconscious confirmation bias and the temptation to favor certain outcomes. If you don’t know which group and what result you are looking for since you aren’t aware you can’t favour information from a certain group and discard information from others to fit your narrative. But this method is one that is logistically demanding which requires specialized infrastructure, and often doesn’t translate well to fields like certain more qualitative fields like the social sciences, wherein data are more complex and harder to anonymize meaningfully in order to draw conclusions. For example, in a qualitative study on trauma survivors, it would be nearly impossible to blind researchers to which participants experienced which events, since the data often comes from personal interviews and context-driven narratives that can’t be anonymized without losing meaning. If this were to be double-blind, either the conclusions would be too difficult to draw since we can’t make claims without either one if both the researcher and participants don’t know which group is which, and additionally there is always the risk of unblinding where either researchers or participants figure out which group they are a part of (placebo or experimental) and then bias themselves. Another example would be medicine wherein some drugs may require specific procedures or have delivery methods that are difficult to mask or replicate for the portion of the sample size that are under the placebo condition. For instance, a drug that needs to be administered intravenously or has a unique taste might be hard to conceal and if participants or researchers know that they are.

I also considered adversarial collaboration, where rival researchers with opposing hypotheses work together to design and interpret a study. This is a powerful safeguard against hypothesis myopia and asymmetric attention, but in practice, it is very difficult to scale, relies on personal cooperation, and would be nearly impossible to require across all submissions since many researchers have specific niches or approaches that they are known for. If they collaborate with opposing researchers and are proven wrong consistently they might ruin their reputation in the industry since their approach is constantly being opposed within the same study. Each of these methods addresses important dimensions of the problem, but they either depend on post hoc correction, high resource availability, or voluntary cooperation and none of which offer the structural, scalable shift that registered reports and confirmation of these reports’ data analysis process and methodologies provide.

Registered reports, by contrast, preemptively neutralize several of the core issues discussed across all the readings. Greenland et al. (2016) outline 25 misinterpretations of p-values, but many stem from p-values being treated as the final arbiter of truth after-the-fact. Registered reports disrupt this by removing the incentive to chase a specific p-value. Since journal acceptance would be granted based on the rigor of the method in this solution, and not just the result, there’s no advantage to p-hacking, cherry-picking, or “hypothesizing after results are known” (HARKing), which Nuzzo (2015) and Bohannon (2015) both show to be widespread and deeply misleading practices. This solution also mitigates the Texas sharpshooter fallacy and “just-so storytelling” effects which Nuzzo describes as cognitive traps that encourage researchers to impose their narratives on random noisy data. With a pre-registered plan, researchers cannot retroactively fit their story to the data, which prevents both the overinterpretation of noisy data and the media exaggeration of invalid data shown as ‘causal’ significance. Bohannon’s hoax exploited precisely this vulnerability even further with their small dataset and many variables with no pre-specified hypothesis before the data-collection. Had the chocolate study been evaluated as a registered report beforehand, it would have never passed the initial design review due to its tiny sample size, lack of analysis in specific variables instead of guessing and getting lucky by chance, alongside the absence of controls for confounding variables. Even the incentive to submit such a study would have diminished since researchers wouldn’t waste their time and resources into these practices that are unethical just to publish research which would fail the review stage anyway.

Naturally however, registered reports are not without drawbacks. One key limitation is that they require researchers to submit their study design, hypotheses, and analysis plan for peer review before data collection begins, which can significantly slow down the publication process, something that may deter researchers under pressure to publish quickly. This solution also assumes that research questions and hypotheses are clearly defined in advance, which may not align with the nature of some specific observational research, where insights can often emerge from open-ended questions rather than set hypotheses to begin with. Additionally, not all academic fields have the infrastructure/editorial expertise needed to critically evaluate study designs in the absence of results, which may make this model harder to implement in smaller, less-resourced disciplines. Still, despite these challenges, the strength of pre-registered reports and hypotheses, alongside study design lies in how they fundamentally shift the incentives. Rather than rewarding surprising or “significant” results, they prioritize rigorous methodology and transparency, helping reduce both statistical misuse and the temptation to change/selectively draw out findings for impact.

Essentially, while there are many good solutions proposed across the readings, most act downstream, meaning simply catching problems after they occur. Requiring registered reports act upstream, changing the incentive before data are even collected. It forces researchers to justify their methods and frameworks, and not their p-values, which is where ethical scientific analysis should begin. For a journal editor tasked with reducing both misuse and self-deception, this idea aligns most directly with the deeper epistemic reforms that all four texts argue are long overdue.

Task 2: Modelling a complex world

Q1: Which independent variables might you include in your regression, and what’s your rationale for including/excluding them?

Our goal in this task is to select independent variables that offer strong explanatory power, are easily interpretable and understandable to draw conclusions, while avoiding redundancy in data conclusions or multicollinearity problems. A useful independent variable in this context should meet a couple of different criterias in order to be strong independent variables. I would be looking for variables that vary meaningfully across countries and should have a plausible causal relationship to economic productivity in that nation. Additionally, it should not be a factor that forms the equation for GDP per capita in order to avoid reverse causality or double-counting the same variable in order to show strength/statistical significance.

Some key independent variables I would want to include are average years of schooling (education indicator), life expectancy (healthiness indicator), trade openness (exports and imports represented as a % of GDP, economic indicator), rule of law/corruption index scores (legality indicators help confidence in economic performance) and energy consumption per capita. To begin with education, it is one of the most important long-term drivers of economic productivity since education allows a higher percentage of skilled workers traditionally, which allows a broader set of skills you have in your arsenal that can be monetised in labour markets which increases your income capacity rapidly because you provide the labour market with your range of skills and specialisations that people would pay more for. Simply speaking, countries with higher levels of education (i.e Canada, Japan, and Ireland) produce a more skilled labor force (finance, engineering, medicine and technology for example), leading to more innovation (such as AI), and in turn this increases labour productivity allowing higher wages on average ceteris paribus. Countries like South Korea and Finland have invested heavily in education and now enjoy high GDP per capita since education accounts not only for how knowledgeable people on average would be but also the value of your human capital. However, it is important to be cautious of reverse causality, since richer countries also tend to spend more on education, potentially making the relationship bidirectional which can be slightly biased. Including education would require caution and potentially controlling for initial income levels to avoid misleading inferences.

Another key variable that isn’t as intuitive straight out of the gate is life expectancy and similar health indicators, but the reason it’s important is because population health influences labor productivity as well. A workforce burdened by poor health has higher absenteeism, lower cognitive performance, and reduced longevity in employment. For example, Singapore, with its young population, maintains one of the highest GDPs per capita in part due to the longevity and productivity of its healthy population. Conversely, countries like Lesotho or Chad with significantly lower life expectancies often face constrained economic growth because of worries about their health and longevity, therefore not being able to focus completely on income levels. Low life expectancy typically reflects broader failures in healthcare systems, nutrition, or sanitation infrastructure. Including this indicator also indirectly accounts for public health investment and institutional capacity, however this is also bidirectional since richer countries would invest more in public healthcare investments. However, we must be careful to not assume all high life expectancy populations are equally economically productive, especially in countries with high elderly dependency ratios such as Japan where I predict GDP per capita might decline since a large proportion of the population is retired and not actively contributing to economic output. This puts pressure on the working-age population and can reduce overall labor productivity and growth potential.

Another factor I think is important is trade openness, since it indicates a country’s integration with global markets. In a world that is more globalised than ever before, with countries collaborating across global supply chains on technology and trade, this factor allows us to account for the government’s impact on GDP per capita by evaluating trade policies with trade openness being a proxy. A great example of this would be countries like Hong Kong or the UAE since they rely heavily on trade and have among the highest GDPs per capita globally. Their openness facilitates technology transfer and knowledge transfers importantly, as well as competitive pressure, and specialization in high-value sectors. For example, one of Hong Kong’s highest revenue generating sectors are logistics and trade which speaks directly to the trade openness policies and location of HK. On the other hand, on the complete other end of the spectrum is North Korea, one of the least open economies, and it has an extremely low GDP per capita. However, this variable should be interpreted in the context of North Korea and it would be inaccurate to say the reason North Korea has a low GDP per capita is solely based on trade openness because there are several reasons with this being only one small part of the complete context. A core reason for North Korea’s low GDP per capita could be government centrality or corruption. Some countries with high trade-to-GDP ratios, like Liberia or Sudan, show weaker economic outcomes due to the commodity dependence crisis and the lack of diversification. Therefore, trade openness is best combined with looking at the surrounding context, such as the quality of exports and imports so we understand if the openness is truly beneficial or not really the most relevant variable.

A couple of variables I would not include would be population size and variables that are outcomes, rather than causes of high GDP per capita such as car ownership or internet penetration rates. The reason for excluding population size is quite straightforward since GDP per capita is calculated by dividing total GDP by population, the influence of population is already embedded in the dependent variable. Including it again as an independent variable would introduce redundancy and risk distorting the model. More importantly, the relationship between population size and per capita income is not linear or consistent across countries. For instance, China and India have enormous populations but relatively low GDP per capita, while Luxembourg and Singapore have small populations yet very high GDP per capita. Including population would not help explain variation in income levels and could instead create misleading associations driven by outliers. Additionally, population size does not tell us anything about the age structure, productivity, or education level of the population, all of which matter far more for economic output per person. Additionally, variables such as car ownership or internet penetration, while intuitively correlated with income, are more likely to be symptoms of economic growth rather than its causes. For instance, Qatar has extremely high car ownership, but that stems from pre-existing wealth rather than a driver of productivity. Including these variables risks mistaking the effects of income for its determinants. Furthermore, such variables may be affected by cultural or geographic factors (e.g., urban planning or public transport usage) that confound their interpretability.

Q2: Regression rests on several assumptions (linearity, independence, etc.). Which assumption is most likely to be violated in cross-country data, and why?

When running regressions on cross-country data to explain differences in GDP per capita, the assumption that is most likely to be violated is the independence of observations. This assumption means that each country’s data point is assumed to be unrelated to the others, specifically that the error terms in the regression model are uncorrelated across countries. In other words, what happens in one country should not systematically influence what happens in another. However, in a globalized world, this assumption rarely holds true. Countries are highly interconnected through trade, migration, investment flows, regional politics, and even shared exposure to climate risks or financial crises. For example, if the United States enters a financial crisis, countries that are closely tied to it through trade and capital flows such as Mexico or Canada may also experience economic downturns. These downturns are not caused by domestic mismanagement in those countries, but rather by spillover effects that create similar shocks across national borders.

This lack of independence can show up as what economists call spatial/cross-sectional dependence, which just means that neighboring or economically linked countries can exhibit patterns in their data that are more similar to each other than to countries elsewhere. Regional organizations like the ASEAN countries often harmonize economic policy, infrastructure investment, and trade laws, which can cause GDP per capita outcomes to cluster within regions.

Even shared historical legacies such as colonial rule or membership in former trade blocs like the Soviet Union can lead to structural similarities that violate the independence assumption. What I’ve learned from my home university is that when we have strong co-dependence between countries and when the assumption of independence is violated, the standard errors that come out of the regression are no longer reliable. That means statistical tests we use to determine whether relationships are meaningful can become misleading. We might find a relationship where none exists or miss one that does because the underlying data model assumes each country behaves in isolation when in fact they do not.

Although independence is the most vulnerable assumption in cross-country analysis, it is not the only one that deserves attention. The assumption of linearity, which says that the effect of a variable like education or health on GDP per capita is constant across all levels, can easily be unrealistic. For example, increasing years of schooling from two to six in a low-income country may dramatically raise productivity, while increasing it from fourteen to eighteen in a high-income country might have a much smaller effect. Another common issue is heterostedasticity, which refers to unequal variability of error terms across observations. Poorer countries often have more volatile growth patterns and less stable institutions, which may lead to greater noise in the data. This makes the assumption of equal variance in errors less likely to hold. There is also the risk of omitted variable bias, where important qualitative variables such as more unique geographical impacts, more unseen political instabilities, or cultural norms are left out of the model. If those variables are correlated with both the included independent variables and GDP per capita, they can distort the results of the regression.

Despite these challenges, the violation of independence is the most systematic and deeply rooted issue when working with cross-country data. Recognizing and addressing this assumption explicitly helps improve the credibility of our analysis and prevents us from drawing overly confident or inaccurate conclusions about what drives economic prosperity across the world.

Q3: Suppose your model finds a statistically significant relationship between education and GDP per capita. Policymakers claim this “proves” education causes economic growth. How would you challenge or qualify that claim?

A statistically significant result only means that we have observed a consistent pattern in the data that is unlikely to be due to chance. It does not tell us whether that pattern reflects a cause-and-effect relationship. In technical terms, correlation is not the same as causation. Just because education levels are higher in countries with higher GDP per capita does not automatically mean that increasing education will directly raise income levels. There are several reasons why this conclusion needs to be challenged. First, the model may suffer from reverse causality. This means that while we may think education leads to higher GDP per capita, it is also very possible that richer countries can afford to invest more in their education systems. In this case, it is economic growth that causes more education, not the other way around. For example, Scandinavian countries like Norway or Sweden spend heavily on education and have high GDP per capita, but they also have a long history of stable economic development, which allowed them to invest in their education systems over time. So the observed relationship could be flowing in the opposite direction.

Next, there is the issue of omitted variable bias. This happens when the model leaves out important factors that influence both education and economic growth. For example, countries with strong government and legal institutions, low corruption with checks and balances in government, or effective governance programmes by local mayors may perform better both in terms of educational quality and economic development. If we do not include these institutional outside extrinsic variables in the model, we risk attributing the effect of these missing factors completely to education. As a result, the estimated impact of education may be overstated or even completely misleading. Additionally, education itself is not a single, uniform factor since not all education systems are equally effective and balanced as a constant variable we can apply just by looking at the number of years. Simply increasing the number of years people spend in school does not guarantee economic productivity. The quality of education, the relevance of the skills taught, and how well education matches the needs of the labor market all play a role in education as a factor. For example, some countries have very high school enrollment but poor learning outcomes, which limits the potential economic benefits (like India and Bangladesh). In contrast, countries like Singapore or South Korea have aligned their education systems with their development goals, focusing on skills and innovation, which better supports their economic growth and focuses on real alignment to bigger goals. GDP per capita does not fully capture all aspects of economic growth and education is often measured by average years of schooling, which may not reflect the true skill level of the workforce. So even if the relationship is statistically strong, the variables may not be capturing the deeper dynamics that actually drive growth in the economy. Rather than saying education causes economic growth so blatantly, we should say that education is associated with higher GDP per capita and may be part of a broader set of conditions that support development. This more careful approach avoids oversimplification and leads to better policy decisions based on a fuller understanding of how complex development actually works.

I would say that education is a necessary but not sufficient condition for growth. It contributes meaningfully, especially when it is of high quality, well-targeted, and supported by strong institutions, but it does not act alone or guarantee growth in every case. For example, if a country increases education spending but fails to address corruption, lacks job opportunities, or provides education that is misaligned with labor market needs, then those additional years of schooling may not translate into higher productivity or income. Rather than rejecting the policymakers’ claim completely, I would qualify it by saying that education is clearly an important part of long-term development, but its impact depends on many other factors. It matters how the education system is set up, how strong the country’s institutions are, and whether the economy can actually provide good jobs for educated people. If those things are missing, then just increasing education on its own may not lead to real growth. So while education helps, it works best when it is part of a wider and well-supported development plan.

Q4: Instead of only regression, how could exploratory visualizations (scatterplots, grouping, comparisons, etc.) give complementary insights into the same question?

To begin, I’d love to discuss what exploratory visualizations really are. In my view, exploratory visualizations are deeply valuable tools that can offer insights beyond what a regression model alone can provide when analyzing the relationship between education and GDP per capita. This is because while regression gives us a numerical summary of the relationship such as a coefficient that tells us the direction and slope of the effect as well as whether it is statistically significant, visualizations on the other hand, help us see patterns, trends, clusters and outliers in a much more intuitive and accessible way. This is important because real-world data is rarely clean and simplified so visuals help us engage with the data more critically before jumping to conclusions. Seeing the information and data laid out in a clearly formatted visualisation allows us to have this ‘big picture’ view on what is happening in the data. Visualisations help us spot patterns, irregularities, and structures that might remain hidden behind coefficients and p-values.

For example, a scatterplot that plots countries’ average years of schooling on the x-axis and GDP per capita on the y-axis can immediately show us whether the relationship looks linear or not. We might notice that for low-income countries, small increases in education lead to sharp increases in GDP per capita, but for wealthier countries the curve flattens out which would suggest diminishing returns to education. This kind of non-linear pattern might be missed or misrepresented in a basic linear regression unless we model it explicitly with something like a quadratic term. Log transformations of GDP per capita are also helpful, especially when comparing countries at different stages of development. They reveal non-linear effects, such as the idea that one extra year of schooling in a low-income country might produce a much larger economic return than the same increase in a high-income country. Visualizations can also help us spot outliers such as countries that have high education levels but relatively low GDP per capita or vice versa. These outliers can be crucial in helping us think more critically about what else might be going on such as issues with governance, conflict or natural resources that distort the pattern. Group comparisons also add another layer of insight. For instance, we might use labelled keys and labelled groups to compare specific regions like North Africa (Algeria/Morocco/Tunisia) vs. Western Europe (Germany/France/UK/Switzerland) and quickly observe that the relationship between education and income looks very different across these groups. Even if the overall regression shows a positive relationship, certain regions might show weaker or stronger correlations. This would add some colour about the possible interactions we wouldn’t be able to tell otherwise, where the impact of education on GDP per capita depends on other factors such as political institutions or economic openness.

Furthermore, we could use boxplots or grouped bar charts to compare the distribution of GDP per capita for countries with low, medium and high levels of education. This can show whether there is large overlap between these categories or whether the differences are truly meaningful. Therefore visualisations are key in this process since they help us explore the data, develop hypotheses, check for patterns that we might want to model more formally and also communicate findings in a way that is accessible and honest about the data’s complexity. While regression models are powerful for quantifying relationships, they can also be misleading if used blindly. Visualizations bring us closer to the raw evidence and can often reveal nuances that summary statistics hide. In this way, combining both approaches leads to more robust and thoughtful analysis.

Task 3: beyond the average

Q1: Why is it problematic to only focus on the “average case” in social science data?

Focusing only on the “average case” in social science data is problematic because it hides important variation and nuance in how different groups or individuals experience social/economic phenomena since it excludes data away from the average. Koenker and Hallock (2001), in their paper on quantile regression, show that ordinary least squares (OLS) regression, which is the most common statistical tool used to model relationships between variables, only tells us how an independent variable (like education) affects the mean, or average, of a dependent variable (like income). But in reality, policy impacts and social outcomes are rarely uniform across all people and therefore using the “average case” doesn’t allow us to fully understand the problem, and if we can’t even understand the roots and extremes of a problem, as policymakers we can never be able to then solve them. To understand why this matters, we first need to explain what OLS does. OLS is a method that draws a straight line through a scatterplot of data points in such a way that it minimizes the total squared distance between the line and each point. This line represents the average relationship between variables. For example, OLS might tell us that each extra year of schooling increases a person’s income by 5,000 GBP on average. But this average hides something crucial, which is that the increase may be very different depending on whether someone is at the bottom (under the poverty line) or the top (centi-millionaire/billionaire status) of the income distribution. For example, Koenker and Hallock show in their application to wage data that the return to education is much higher at the top of the income distribution than at the bottom. This tells us that the economic benefit of an additional year of schooling is not the same for everyone.

Ignoring these differences can lead to misleading or even harmful policy conclusions. If we only look at the average effect, we might assume that investing in education helps everyone equally, when in reality it may disproportionately benefit those who are already advantaged. This is especially important in social science, where issues of inequality, marginalization, and access are central. Quantile regression according to these researchers lets us ask more in-depth and precise questions for social sciences, which is that we want to know who gains the most vs. who gains the least and why? In conclusion, Koenker and Hallock argue that by relying solely on average effects through OLS, researchers risk overlooking crucial variation and drawing overly simplistic conclusions.

Q2: Give a real-world example where focusing on the extremes (e.g., the poorest, most at-risk, or highest-performing) reveals insights that the average hides.

A powerful real-world example of why we must focus on the extremes rather than just the average can be found in the detailed babyweight analysis presented by Koenker and Hallock (2001). In their paper, they examine how factors like maternal education, smoking, age, and prenatal care among other factors like marital status affect the birthweight of infants. If we were to rely only on traditional methods like the OLS regression, which only estimates the average effect of each of these factors, we would miss something vital that has real-world implications and consequences. For instance, while the average difference in birthweight of babies in mothers who smoked vs. those who didn’t may seem modest/low, however, quantile regression shows that this effect is far more severe and exaggerated at the lowest end of the birthweight distribution. When we look at this tail of the distribution at the lower end, we see that is where the babies are most vulnerable to health risks. In fact, they find that the effect of maternal smoking is significantly more damaging among the lightest babies, which is exactly the group most at risk of medical complications and long-term developmental problems. This insight would be completely invisible if we looked only at the average baby. By shedding some light on the tail ends of the distribution (in the extremes), especially the lower tail where infants are under 2500 grams, this analysis gives a much more precise picture of where policy interventions should be focused.

To link this with a real-world example that I came up with, consider public education spending and student performance in under-resourced schools. I thought about this when watching the movie “Coach Carter” and seeing how under-funded schools normally have the worst test scores, students that have the highest dropout rates and suffer from poor literacy rates. If policymakers only look at the average test scores of students across the country and consider national averages for SAT scores or A-level scores, they might conclude that education funding levels are acceptable. However, this would hide what’s happening at the bottom in the lowest decile or quantile in test-scoring performance. Just like with birthweight, these students are disproportionately affected by underfunded schools. At the average level, increases in funding might seem to have small effects on test scores, but if we use quantile regression to isolate the impact on the bottom ten percent of students, we might find much stronger effects. Using a tool like quantile regression could help us uncover, for example, that an extra dollar spent on schooling has a much larger effect on test scores at the bottom ten percent than at the median. This insight is crucial because it tells us that average-based policies can be completely ineffective for those who need the most help. In both the babyweight case and the education example, the average masks the truth. If we are only looking at what works for the “average” case, we risk reinforcing systems that fail the most vulnerable. Koenker and Hallock (2001) shows us that the extremes often carry the most important data points sometimes, and ignoring them can lead to harmful or under-researched conclusions, especially in the social sciences.

Q3: Compare how might linear regression vs. quantile regression tell different stories about the same dataset?

When analyzing a dataset, linear regression and quantile regression can both offer valuable insights, but they tell fundamentally different stories. Linear regression tries to estimate the average relationship between the dependent variable and one or more independent variables. This is done by minimizing the average squared distance between the actual data points and the predicted line, which we call the line of best fit. It gives us one summary effect for the entire dataset using OLS and showing the linear regression as a constant slope which is great for understanding how the data moves on average. For example, if we were looking at the relationship between education and income, linear regression would give us one single estimate showing how much additional income is associated, on average, with each extra year of schooling. While this is useful for understanding general patterns, it can be misleading if the effect of the variable is not the same across all parts of the population. In many real-world cases, the effect of a variable like education is not the same for someone at the bottom of the income distribution as it is for someone at the top.

Quantile regression, on the other hand, gives us a more detailed and layered picture. Instead of looking only at the average, it estimates the relationship at different points in the outcome distribution. This could be the 25th percentile, the median, or the 75th percentile but also randomly looking at the 90th or 10th percentile too. In simpler terms, quantile regression asks how the independent variable affects people at the bottom, the middle, and the top separately. It does not assume that one effect fits all and this matters especially in datasets where there is inequality or skewed distributions. An example would show that for example, we can ask how education affects low-income individuals compared to middle-income or high-income individuals separately. Most importantly, this method does not assume uniformity across the distribution, which makes it more flexible and often more honest about what is really going on.

One of the clearest examples of this comes from Koenker and Hallock’s (2001) discussion of CEO salaries. If we use linear regression to study what explains CEO pay, we find that firm size and profits are positively correlated with salary however this is only seen in the averages. Quantile regression reveals that these variables have a much stronger effect at the upper end of the pay scale than at the lower end. For instance, an increase in company revenue might lead to a small bump in salary for a mid-level executive, but a much larger raise for the highest-paid CEOs. In this case, the average completely reduces the depth of conclusions we can draw and the deeper understanding we can infer about wealth inequalities. Without quantile regression, we would’ve completely ignored the fact that the most elite executives are worsening the wealth inequality, and that company performance is amplifying inequality in pay at the top. If we only relied on linear regression, we might wrongly conclude that pay is relatively stable and fair across the board, when in fact it is highly skewed in favor of those already at the top. This is why quantile regression is so valuable, especially in social science data where distributions are rarely symmetrical and outcomes like income, health, or education are shaped by inequality. It lets us challenge overly simple conclusions and produce analysis that is more sensitive to real-world variation.

Task 4: When Averages and Visualizations Mislead

Q1: What lesson does Anscombe’s quartet teach about relying only on summary statistics like the mean or R²?

Anscombe’s Quartet demonstrates why relying only on summary statistics such as the mean, variance, correlation coefficient, or R squared can be misleading, especially in social science data. In the example shown by Sparsh Gupta in “Anscombe’s Quartet: What Is It and Why Do We Care?”, we are presented with four datasets that appear nearly identical based on key statistical summaries. All four datasets share the same mean and variance for both the x and y variables, the same linear regression line, and the same R squared value, which measures the proportion of variation in the dependent variable explained by the independent variable. However, when these datasets are visualized through scatterplots, they reveal very different underlying patterns.

The first dataset follows a clear linear pattern and supports the use of linear regression. The second shows a non-linear relationship, where a curved pattern suggests that a linear model would be inappropriate. The third contains an influential outlier that distorts the regression line despite the rest of the data being tightly clustered. The fourth dataset includes nearly all points aligned vertically with one extreme point that artificially creates the illusion of a linear relationship. These examples illustrate that statistical summaries alone cannot capture the structure or complexity of a dataset. This lesson has important implications when working with real-world data, such as analyzing GDP per capita across countries. Suppose a researcher finds a high correlation between education levels and GDP per capita and a strong R squared value from a regression model. If they do not visualize the data, they might overlook whether this relationship holds across all countries or is disproportionately influenced by a few high-income nations. Without plotting the data, the researcher would be unable to detect issues like outliers, regional clustering, or non-linear trends that could invalidate the model’s assumptions.

In short, the key insight from Anscombe’s Quartet is that statistical summaries can mask important variation in the data. Visualization helps uncover these hidden patterns and ensures that analysis is grounded in the actual distribution and structure of the data rather than only in abstract numerical indicators. Before drawing conclusions or building models, it is essential to explore the data visually to assess its suitability for the intended analysis and to avoid incorrect or oversimplified interpretations. However, it is also important to acknowledge that visualization is not a perfect substitute for statistical analysis. While it is essential for identifying patterns, anomalies, and structural issues in the data, it remains a subjective tool that depends on how the graph is presented and interpreted. Visualizations can sometimes oversimplify complex relationships or create false impressions depending on scale, axes, and grouping. Therefore, visualisation should be seen as one valuable component of a broader analytical process, not as a complete solution in itself.

Q2: Imagine two countries with the same GDP per capita. Give two different ways their economic situations could still be very different (justify your answer).

Two countries with the same GDP per capita can differ dramatically in terms of both economic equity and long-term resilience, making GDP per capita an incomplete and sometimes misleading indicator of a country’s economic reality. Gupta (2022) and Mouschoutzi (2025) emphasize this idea in the context of Anscombe’s Quartet, wherein identical averages can mask entirely different underlying patterns which would have helped us understand why countries with the same GDP per capita can be vastly different to one another. Just as datasets in the Quartet shared the same mean and R² but have completely different distributions, countries too can appear similar on paper while differing drastically in the lived experiences of their populations. The core insight is that averages can reduce complex distributions into single numbers like the mean and in doing so, often erase crucial structural differences that are important for nuanced and more in-depth analysis into a problem, especially in the social sciences. This applies to GDP per capita too since without deeper investigation, it raises questions of how wealth is distributed, generated, and whether it can be sustained over time,

One major way two countries with the same GDP per capita can differ is in the level of income inequality. For example, Brazil and Slovenia have both reported comparable GDP (PPP) per capita in recent years, hovering around the $21,000 to $22,000 range according to 2023 estimates. However, their Gini coefficients, which measure income inequality on a scale from 0 (perfect equality) to 1 (perfect inequality), tell a completely different story. Brazil’s Gini coefficient is around 0.53, among the highest in the world, while Slovenia’s is about 0.24, one of the lowest. What this means in practice is that while the average income might look similar on paper, the typical Brazilian is far more likely to live in poverty, have lower access to healthcare and education, and face social instability than someone in Slovenia. This distributional inequality affects everything from life expectancy to crime rates and political trust. GDP per capita as an average gives no indication of these disparities. Theoretically, this is tied to the limits of using the mean as a representative measure when the underlying distribution is skewed or bimodal, which is often the case in unequal societies.

A second fundamental way two countries can differ despite having the same GDP per capita is in the structure and sustainability of their economies. This matters because similar income levels may be generated through very different types of economic activity, some of which are more stable, inclusive, or future-oriented than others. For example, we could consider Mexico and China, both of which had a 2023 nominal GDP per capita of roughly $13,700 and $12,500 respectively. At first look, they appear economically comparable but if we dig deeper, these countries differ significantly in both income distribution and social infrastructure. China has a much stronger record of investing in state-led infrastructure, education, and healthcare through centralized planning. In contrast, Mexico has struggled with high levels of informal labor, unequal access to education, and regional economic disparities. While both may report similar per capita income levels, the distribution of that income and the quality of public services create vastly different experiences for their populations. This mirrors the idea from the Datasaurus Dozen in Mouschoutzi’s article, datasets can have identical statistical summaries but entirely different internal structures.

In conclusion, GDP per capita can be a starting point to use as a proxy for economic success and stability but must be considered only as a starting point and not the end point for our analysis in understanding economic well-being. Two countries might appear statistically similar, but without exploring the distribution of income and the structure of the economy, we risk drawing false equivalencies and promoting ineffective or even harmful policy conclusions.

Q3: How might a regression model using GDP per capita as a dependent variable miss important internal variation within countries?

A regression model that uses GDP per capita as a dependent variable may overlook important internal variation within countries because GDP per capita is an average that compresses a wide range of economic realities into a single summary statistic. While it is commonly used as a proxy for national income or development, it obscures critical differences in how income and opportunity are distributed within a country. Two countries may share a similar GDP per capita, yet the distribution of wealth, access to services, and quality of life may differ drastically across regions, social groups, or urban and rural populations.

For example, in a country like Brazil which has a relatively high GDP per capita (considering the region), the GDP per capita figures can hide extreme income inequality between urban centers like São Paulo and poorer rural regions in the northeast. Even within São Paulo, there is massive wealth inequality in the favela neighbourhoods compared to the skyscraper modern commercial neighbourhoods. A regression that uses only national-level GDP per capita may conclude that certain variables, such as education or infrastructure, have a modest overall effect. However, if the model were disaggregated by region or income decile, it might reveal that those same variables have a much larger effect in disadvantaged areas, where marginal improvements can make a significant difference. In other words, the model may fail to detect where policy interventions are most needed because it is averaging over too much heterogeneity. Further if we take India as an example with large urbanized cities like Mumbai or Delhi, there is sharp inequality between wealthy districts and informal settlements. A regression model working with national averages would not detect the ways in which urban poverty, slum housing, and informal labor conditions offset the benefits of growth in GDP per capita. These forms of internal variation would be impossible to capture by a model that simply treats the entire country as solely one single unit of analysis.

Therefore, regression models that rely solely on national GDP per capita risk missing the deeper inequalities and regional disparities that shape economic realities within countries. Techniques like quantile regression, along with more disaggregated data, offer a richer and more accurate understanding. Without such tools, we risk drawing misleading conclusions that ignore those most in-need issues for policy attention, leading to harmful consequences if ignored.

Q4: Suggest an alternative or complementary indicator to GDP per capita. How could including this indicator change the story told by the data?

A strong complementary indicator to GDP per capita is the Multidimensional Poverty Index (MPI), developed by the Oxford Poverty and Human Development Initiative and I read about it in a report by the UNDP (United Nations Development Programme). We know that GDP per capita measures economic output per person, however, the MPI captures is much more interesting in my opinion since it looks at deprivations in health, education, and standard of living at the household level. It combines ten indicators, each being weighted differently based on how large their impact would be, some factors within this index include child mortality, nutrition, years of schooling, school attendance, sanitation, drinking water, electricity and housing to give a more nuanced picture of poverty and inequality within countries. This does not only speak about poverty, but speaks directly to the national economy since using this we might now know which areas we need to focus on more than others since an additional 1$ investment will lead to larger impacts in certain regions when compared with others.

Including the MPI alongside GDP per capita can dramatically improve the narrative about having a well-rounded understanding since we are looking at economic output but also health, and education in one index/metric. An interesting example would be India again since India had a GDP per capita (in PPP terms, which accounts for cost of living) of around $10,166 in 2023, which might suggest moderate prosperity when compared to other developing countries. However, the 2023 MPI report shows a very different reality. It reveals that 16.4 percent of the population still lives in multidimensional poverty, which translates to over 230 million people. Crucially, the MPI report highlights vast regional disparities, for example, the Indian state of Bihar had an MPI of 0.237, which is one of the highest in the country, whereas Kerala had an MPI of 0.008, one of the lowest. This reveals a nearly 30x difference in the extent and depth of poverty between two regions within the same country so we see that now we can actually identify internal variation within a country. If we only used GDP per capita, these internal inequalities would remain hidden, and policymaking could overlook the regions and communities most in need of support.

Another example is Nigeria, with a GDP per capita (PPP) of around $6,207, relatively close to Ghana at $7,543. But according to the 2023 MPI, Nigeria has much higher multidimensional poverty rate than Ghana, which shows us that even though geographically similar and economically close together – one country is investing significantly more on education and healthcare which is an indicator of economic growth. It shows us that even with similar average incomes, the extent and intensity of deprivation can vary widely, especially in access to education, healthcare, and basic infrastructure.

By including MPI, policymakers and researchers can better understand who is being left behind, allowing for more targeted interventions. It challenges the assumption that simply rising GDP per capita automatically translates to better living conditions for all. Combining MPI with GDP per capita brings us closer to understanding the full spectrum of human development and helps prioritize interventions that reach the most deprived.

Q5: Sketch or describe a visualization of country-level GDP data. How could a poorly chosen visualization mislead a policymaker or audience?

A useful and intuitive way to visualize country-level GDP data is through a choropleth world map, wherein each country is shaded (let’s say blue) based on its GDP. Darker shades of blue can represent higher values and lighter shades of blue would show lower values of GDP. This approach works since we are working with univariate data, meaning that there is only the visualisation of one variable across geographic regions. This allows us to quickly identify spatial patterns, such as how wealth concentrates by region by looking at North America vs. Europe vs. Sub-Saharan Africa for example. A policymaker can glance at such a map and get a rough sense of global income inequality or regional disparities. Furthermore, the gradient of blue can show internal variation since if we zoom into a country like China, the regions should tell us even within China which areas have higher GDP per capita than others.

However, a poorly chosen visualization could seriously mislead the audience by being confusing or simply not showing any internal variation. One example would be using a bar chart with country names on the x-axis listed in alphabetical order instead of by value/region. If all the countries are plotted in alphabetical order, the chart would become visually extremely confusing and practically unreadable. This is a mistake I have made in the past and I know that this would be very difficult to analyse and draw more nuanced claims from. This sort of visualisation does obscure meaningful patterns, because we can’t make claims in regional groupings or extremes in the data, since countries with similar GDPs could be placed far apart on the graph simply because of their names. A country like Singapore would be next to Sierra Leone despite having vastly different GDP per capita values and we don’t really understand any deeper claims from that grouping or proximity these 2 countries would have to one another. This would make it difficult for policymakers to detect which countries are performing similarly or differently and would be hard to see also how integrated nations are like the EU vs. ASEAN vs. African Union and so on, this overall undermines the purpose of the visualization.

Q6: Redesign your visualization so it communicates the uncertainty and internal variation more honestly.

To redesign the GDP per capita visualization in a clearer and more honest way, I would add a second layer of information to show inequality or poverty, not just income levels. A simple world map colored by GDP per capita can be misleading because it only shows the average income, and doesn’t reveal how that income is shared within a country. So instead of just one color per country, I would use a map that includes a second feature like the Gini coefficient (which measures income inequality) or the Multidimensional Poverty Index (which shows how people are deprived in areas like health, education, and housing). For example, I could show GDP per capita with color, and then use patterns like dots or lines to show inequality levels. That way, a country that looks rich at first glance could also be seen as deeply unequal once the pattern is added.

This would help viewers and policymakers avoid making quick or incorrect assumptions based only on averages. It would make the visualization more honest, more thoughtful, and much more useful for actually understanding what is going on in different countries.

Task 5: Reflection

Q1: What principle(s) will you adopt in your own work to keep your analysis honest, transparent, and critical?

One principle I will definitely adopt in my future work is being more honest and cautious when interpreting results, especially when it comes to statistical tests like p-values. When I first learned about statistical significance, I was guilty of treating the p-value as the ultimate answer because I thought that if the number was below 0.05, it meant something was “true,” and if it was above, it meant the opposite. After reading Nuzzo (2014, 2015) and Greenland et al. (2016), I now understand that this mindset is flawed since a p-value doesn’t prove or disprove a claim at all. It just tells us how compatible our data is with the assumption we are testing, and even then, only under specific conditions. Greenland and his co-authors explained how a common mistake is thinking a p-value tells you the probability a hypothesis is true, or that not getting a “significant” p-value means there is no effect at all.

An example of a past interpretation I made in a module about data analytics with logical fallacies:

In a previous research project, I conducted a comparative analysis to explore perceptions of generosity between North America and Western Europe, aiming to test the stereotype that North Americans are less generous than their Western European counterparts. The research question was: “Does the perception of generosity and its value change across the Atlantic Ocean?” The hypotheses were: H₀: There is no difference in the perception of generosity between North America and Western Europe, and H₁: There is a statistically significant difference.

I applied a number of different methods to test this claim and my tests produced a p-value significantly above 0.05, indicating a failure to reject the null hypothesis. However, in my conclusion I wrote: “this means that the stereotype of North Americans being arrogant and not generous… is false.”

Excerpt from my interpretation:

The clear logical fallacy here is that I accepted the null hypothesis as if it were proven true, rather than simply recognizing that there was insufficient evidence to reject it. Failing to find a statistically significant difference does not confirm that the two groups are identical or that the stereotype is “false”, because it only tells us that our data did not detect a difference. What I should have written is that “The analysis found no statistical evidence of a difference in perceived generosity between North America and Western Europe. However, this does not confirm that no difference exists since further research with larger or more targeted samples may be needed to draw stronger conclusions.” This adjustment would avoid overstating the conclusion and better reflect the limitations of the statistical method I used at the time. Going forward, I will always question whether my analysis reflects the uncertainty in the data and whether I am overstating what I can really claim. I will try to be more transparent in how I communicate results and remember that statistics are tools to help understand complex realities, not simple answers that speak for themselves.

Q2: Which part of this assignment challenged your thinking the most, and why?

The part of this assignment that challenged my thinking the most was comparing linear regression with quantile regression. I had always seen linear regression as the default way to analyse relationships in data, so it was surprising to learn how much it can miss by only focusing on the average. Quantile regression pushed me to think about how effects can vary across different groups, especially for people at the margins who are often overlooked. In my university we almost always only look at linear regressions in data science so learning about this really made me challenge what I had been learning. Understanding that one variable can have a completely different impact at the bottom, middle, and top of a distribution is something I knew but never seemed to question why we didn’t use quantile regression more. It challenged me to think more critically about whose experiences are being represented in the data and how we might be ignoring important variation just because it gets averaged out which is such a common practice.