Goal 1: Business Scenario
First, create your own context for the lab. This should be a business
use-case such as “a real estate firm aims to present housing trends
(and recommendations) for their clients in Ames, Iowa”.
You do not need to solve the problem, you only need to define
it.
Your scenario should include the following:
- Customer or Audience: who exactly will use your
results?
- We represent a non-profit taking the place of USAID and we are
looking for where our services would do the most good. We only have
funding, time, and personel to place ourselves into three countries to
start, thanks to a grant we recieved from a weathy benefactor. Depending
on our work, we have the opportunity to extend the grant, which we will
be able to use to expand our operations and move towards helping more
people.
- Problem Statement: reference
this article to help you write a SMART
problem statement.
- Which three countries would allow us to have the highest year per
dollar return on our investment in the first five years of our
grant?
- Scope: What variables from the data (in the lab)
can address the issue presented in your problem statement? What analyses
would you use? You’ll need to define any assumptions you feel need to be
made before you move forward.
- Our response variable will be life expectancy, and our target
variables will be country and gdp per capita.
- Linear Regression, with a hint of weighted least square means.
- Assumptions
- GDP per capita is related to life expectancy.
- Life expectancy is a good measure of where people need additional
resources and there is an opportunity to help.
- All else being equal, the country specifically does not have an
effect on life expectancy.
- All else being equal, the continent the people exist on does have an
impact on their life expectancy.
- Life expectancy is positively correlated with year.
- The United Kingdom is a valid representative country in infant
mortality for a developed country.
- Afganistan is a valid representative country in infant mortality for
an under-developed country.
- Objective: Define your success criteria. In other
words, suppose you started working on this problem in earnest; how will
you know when you are done? For example, you might want to “identify the
factors that most influence
<some variable>.”
- Identify which variables help determine which countries have the
lowest life expectancies so that we can determine which countries would
therefore benefit from our charity’s services and support.
Goal 2: Model Critique
Since this is a class, and not a workplace, we need to be careful not
to present too much information to you all at once. For this reason, our
labs are often not as analytically rigorous or thorough as they might be
in practice … So here, your goal is to:
Present a list of at least 3 (improved) analyses you would
recommend for your business scenario. Each proposed analysis
should be accompanied by a “proof of concept” R implementation. (As
usual, execute R code blocks here in the RMarkdown
file.)
In the lab your group has been assigned, consider issues with models,
statistical improvements, interpretations, analyses, visualizations,
etc. Use this notebook as a sandbox for trying out different
code, and investigating the data from a different perspective. Take
notes on all the issues you see, and propose your solutions (even if you
might need to request more data or resources to accomplish those
solutions).
You’ll want to consider the following:
Analytical issues, such as the current model assumptions.
- Central limit theory allows us to bootstrap the mean, but not the
median.
- There is an assumption of accuracy of the data.
- There are interesting, but inscrutable assumptions made around
continent categorization, such as what is occurring with Oceania and
FSU.
Issues with the data itself.
- Mean life expectancy is being used, instead of median, and no
information on infant mortality is included or considered. Infant
mortality is a big factor which skews the mean life expectancy data.
Because we’re predicting the mean life expectancy, there is a
possibility this is a poisson distribution and not situation where a
least squared mean is most appropriate.
Statistical improvements; what do we know now that we didn’t know
(or at least didn’t use) then? Are there other methods that would be
appropriate?
- The original model used confidence intervals, but we are interested
in causality which will allow us to better understand which elements are
related to our response variable.
Are there better visualizations which could have been used?
- Overall, the visualizations used for the notebook are appropriate
for confidence intervals, but less so for the content/information we are
seeking.
Better
Visualizations for this problem question include:
- Life expectancy for varying levels of infant mortality.
- Frequency of each country’s representation in the data set to show
how some countries are over/under represented.
gapminder$infant <- ifelse(gapminder$country == "United Kingdom", 3.9, NA)
gapminder$infant <- ifelse(gapminder$country == "Afghanistan", 50, gapminder$infant)
gapminder_filter <- gapminder %>%
filter(country %in% c("United Kingdom", "Afghanistan"))
ggplot(data = gapminder_filter, aes(x = year, y = lifeExp)) +
geom_point(aes(color = country)) +
labs(x = "Year",
y = "Life Expectancy",
title = "Life Expectancy for Countries with Low and High Infant Mortality")

gap_group <- gapminder %>%
group_by(country) %>%
summarise(n_records = n(),
continent = continent) %>%
distinct()
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.
ggplot(data = gap_group, aes(x = reorder(country, n_records), y = n_records)) +
geom_point(aes(color = continent)) +
theme(axis.text.x=element_blank()) +
labs(x = "",
y = "Survey Records",
title = "Number of Observations by Continent")

Feel free to use the reading for the week associated with your
assigned lab to help refresh your memory on the concepts presented.
Goal 3: Ethical and Epistemological Concerns
Review the materials from the Week 5 lesson on Ethics and
Epistemology. This includes lecture slides, the lecture video, or the
reading. You should also consider doing supplementary research on the
topic at hand (e.g., news outlets, historical articles, etc.). Some
issues you might want to consider include:
- Overcoming biases (existing or potential).
- There is no information in the data set that details if a country
wad involved in a (civil) war, epidemic, or other mass casualty event
that skews the median life expectancy.
- There is an assumption that all countries in the data set are
universally recognized with no areas in dispute or under
occupation.
- There may be an overrepresentation in the data of people living in
urban areas in lieu of those living in rural areas due to ease of
collecting data.
- Possible risks or societal implications.
- If you do the analysis wrong, you could get an underwhelming or
under expectations result based on potential investment.
- Crucial issues which might not be measurable.
- Who would be affected by this project, and how does that affect your
critique?
- Donors and philanthropic individuals who seek to gain the most
well-being per dollar of donation who rely on analyses like this to make
informed decisions when making a donation.