Getting Started

Scenario: You are a data analyst for the Star Wars Archives. Your task is to analyze the Star Wars dataset to understand physical traits of characters and present findings clearly to non-technical stakeholders.

Deliverables:

Summary statistics for height and mass (mean, median, range)
Analysis of Skewness and kurtosis of data set
Visualizations: histogram, boxplot (mass), scatterplot (height vs. mass)
A concise 3–5 sentence narrative interpreting your results

Setup: - Ensure you can access the starwars dataset (available via dplyr). - Load any packages you need (e.g., for data wrangling and for skewness/kurtosis).

Part 1: Understand the Data

Objective: Build a quick “data dictionary” of the dataset.

Tasks:

Load the starwars dataset into your R session (via dplyr)
Determine the object type of starwars
Report the number of rows and columns
Explore the structure - Identify which variables are numeric vs. categorical
Identify which columns contain missing values
List all variable names

# TODO: Write code to complete the tasks above.
# Add brief inline comments to explain your findings.

Short Write-up (2–3 sentences): Summarize what this dataset contains and any immediate data quality concerns you notice.

Part 2: Clean and Prepare

Objective: Create a clean analysis-ready subset for height and mass.

Tasks:

Identify variables that are treated as factors and list their levels for at least one factor (e.g., gender).
Identify which variables have missing values.
Create a cleaned dataset that excludes rows with missing values for height and mass.

# TODO: Implement factor inspection and missing-value checks.
# TODO: Create your cleaned dataset (e.g., clean_data).

Short Write-up (1–2 sentences): Explain why you chose to remove the rows you did and how this affects analysis reliability.

Part 3: Summary Statistics

Objective: Report central tendency and spread for height and mass.

Tasks (using your cleaned dataset):

Compute mean, median, and range for mass.
Compute mean, median, and range for height.
Briefly interpret: Do mean and median differ meaningfully? What does the range suggest about spread?

# TODO: Calculate mean, median, and range for mass and height.
# TODO: Store results in clearly named objects or a small summary table.

Short Write-up (2–3 sentences): Compare mean vs. median for each variable and comment on potential outliers or skew based on these numbers alone.

Part 4: Shape of the Data

Objective: Quantify distribution shape for mass.

Tasks:

Compute skewness for mass.
Compute kurtosis for mass.
Interpret the values: Is the distribution symmetrical or skewed? Are the tails heavier or lighter than normal?

# TODO: Compute skewness and kurtosis for mass.
# TODO: Add a short comment about what the values imply.

Short Write-up (2–3 sentences): Explain what the skewness and kurtosis imply about how representative the mean is and whether outliers are likely.

Part 5: Visualize Patterns & Outliers

Objective: Create plots that reveal distribution, outliers, and relationships.

Tasks:

Create a histogram of mass and explain what it shows about the distribution.
Create a boxplot of mass and discuss outliers.
Create a scatterplot of height vs. mass and describe the relationship (if any).

# TODO: Produce a histogram of mass.
# TODO: Produce a boxplot of mass.
# TODO: Produce a scatterplot of height vs. mass.
# Add concise captions/comments beneath each plot.

Short Write-up (3–4 sentences): Which plot was most helpful to identify outliers? What patterns (if any) do you observe between height and mass?

Part 6: Final Briefing (Your Deliverable)

Objective: Communicate findings clearly to non-technical stakeholders.

Include the following:

A compact table or bullet list with mean, median, range for height and mass (based on your cleaned dataset).
A statement on whether the mass distribution is skewed and how you know (reference skewness/kurtosis and plots).
At least two plots that support your conclusions (reuse or improve from Part 5).
A 3–5 sentence narrative that synthesizes your findings in plain language (no jargon).

# TODO: Assemble your final summary objects/plots here so they're visible together.

Reflection (Answer Briefly)

What was easier: computing statistics or interpreting them? Why?
How would your analysis change if you kept rows with missing values?
If you had to predict the height of a new character, what other variables might you use and why?

Submission Checklist

Cleaned dataset created and used for all calculations/plots.
Mean, median, range for height and mass reported.
Skewness and kurtosis for mass computed and interpreted.
Histogram, boxplot, and scatterplot created and explained.
Final 3–5 sentence narrative provided.
Code is readable with brief comments.
Deliverables successfully saved to R script
BONUS : Cleaned dataset is exported to a file type non R users can access

Applied Data Science Project: Star Wars Character Analysis

Instructor: Darth Algorithmus

September 06, 2025

Getting Started

Part 1: Understand the Data

Part 2: Clean and Prepare

Part 3: Summary Statistics

Part 4: Shape of the Data

Part 5: Visualize Patterns & Outliers

Part 6: Final Briefing (Your Deliverable)

Reflection (Answer Briefly)

Submission Checklist