Getting Started

Scenario: You are a data analyst for the Star Wars Archives. Your task is to analyze the Star Wars dataset to understand physical traits of characters and present findings clearly to non-technical stakeholders.

Deliverables:

  • Summary statistics for height and mass (mean, median, range)
  • Analysis of Skewness and kurtosis of data set
  • Visualizations: histogram, boxplot (mass), scatterplot (height vs. mass)
  • A concise 3–5 sentence narrative interpreting your results

Setup: - Ensure you can access the starwars dataset (available via dplyr). - Load any packages you need (e.g., for data wrangling and for skewness/kurtosis).

Part 1: Understand the Data

Objective: Build a quick “data dictionary” of the dataset.

Tasks:

  1. Load the starwars dataset into your R session (via dplyr)
  2. Determine the object type of starwars
  3. Report the number of rows and columns
  4. Explore the structure - Identify which variables are numeric vs. categorical
  5. Identify which columns contain missing values
  6. List all variable names
# TODO: Write code to complete the tasks above.
# Add brief inline comments to explain your findings.

Short Write-up (2–3 sentences): Summarize what this dataset contains and any immediate data quality concerns you notice.

Part 2: Clean and Prepare

Objective: Create a clean analysis-ready subset for height and mass.

Tasks:

  1. Identify variables that are treated as factors and list their levels for at least one factor (e.g., gender).
  2. Identify which variables have missing values.
  3. Create a cleaned dataset that excludes rows with missing values for height and mass.
# TODO: Implement factor inspection and missing-value checks.
# TODO: Create your cleaned dataset (e.g., clean_data).

Short Write-up (1–2 sentences): Explain why you chose to remove the rows you did and how this affects analysis reliability.

Part 3: Summary Statistics

Objective: Report central tendency and spread for height and mass.

Tasks (using your cleaned dataset):

  1. Compute mean, median, and range for mass.
  2. Compute mean, median, and range for height.
  3. Briefly interpret: Do mean and median differ meaningfully? What does the range suggest about spread?
# TODO: Calculate mean, median, and range for mass and height.
# TODO: Store results in clearly named objects or a small summary table.

Short Write-up (2–3 sentences): Compare mean vs. median for each variable and comment on potential outliers or skew based on these numbers alone.

Part 4: Shape of the Data

Objective: Quantify distribution shape for mass.

Tasks:

  1. Compute skewness for mass.
  2. Compute kurtosis for mass.
  3. Interpret the values: Is the distribution symmetrical or skewed? Are the tails heavier or lighter than normal?
# TODO: Compute skewness and kurtosis for mass.
# TODO: Add a short comment about what the values imply.

Short Write-up (2–3 sentences): Explain what the skewness and kurtosis imply about how representative the mean is and whether outliers are likely.

Part 5: Visualize Patterns & Outliers

Objective: Create plots that reveal distribution, outliers, and relationships.

Tasks:

  1. Create a histogram of mass and explain what it shows about the distribution.
  2. Create a boxplot of mass and discuss outliers.
  3. Create a scatterplot of height vs. mass and describe the relationship (if any).
# TODO: Produce a histogram of mass.
# TODO: Produce a boxplot of mass.
# TODO: Produce a scatterplot of height vs. mass.
# Add concise captions/comments beneath each plot.

Short Write-up (3–4 sentences): Which plot was most helpful to identify outliers? What patterns (if any) do you observe between height and mass?

Part 6: Final Briefing (Your Deliverable)

Objective: Communicate findings clearly to non-technical stakeholders.

Include the following:

  • A compact table or bullet list with mean, median, range for height and mass (based on your cleaned dataset).
  • A statement on whether the mass distribution is skewed and how you know (reference skewness/kurtosis and plots).
  • At least two plots that support your conclusions (reuse or improve from Part 5).
  • A 3–5 sentence narrative that synthesizes your findings in plain language (no jargon).
# TODO: Assemble your final summary objects/plots here so they're visible together.

Reflection (Answer Briefly)

  1. What was easier: computing statistics or interpreting them? Why?
  2. How would your analysis change if you kept rows with missing values?
  3. If you had to predict the height of a new character, what other variables might you use and why?

Submission Checklist