Lecture 1 - Course Introduction and Types of Data

Penelope Pooler Eisenbies
MAS 261

2023-08-28

Housekeeping

Introduction 👋
Syllabus 📃
Navigating Slides ⬇️
Today’s plan 📋
- Navigating R and RStudio 🪄
- General Introduction to Statistics and Analytics 📈
- Population vs. Sample 🧮
- Types of Data 🧮

A little about me… 👋

I grew up here and went to SU and

then I …
- worked in Scotland, Slovakia, Lithuania, Chile.
- traveled all over…
- went to graduate school in Oregon and Virginia.
- worked in federal gov’t and private sector.
Now I …
- still do consulting and some research in analytics.
- also helped create the Business Analytics major here at Whitman.

Introduction to R and RStudio 🪄

You have two options to facilitate your introduction to R and RStudio:
- Option 1: Create Posit Cloud account and download and install R and RStudio on your laptop.
- Option 2: Start with free Posit Cloud account and use that and later transition to using R/Rstudio on your laptop.
If you are comfortable with coding: Start with Option 1, but still sign up for Posit Cloud account.
- We will use Posit Cloud for Quizzes.
If you are nervous about coding: Choose Option 2.
For both options: I can help with download/install issues during office hours.
What I do: I maintain a Posit Cloud account for helping students but I do most of my work on my laptop.
NOTE: We will use R and RStudio in class on Thursday (8/31).
- You can use either Posit Cloud or your laptop.

What is Statistics, Analytics, etc.? 📈

Statistics (the discipline) allows us to answer questions about (almost) anything we want to know that we can collect data for.

We start with a POPULATION that we have a question about.
We select a subset from that population, called a SAMPLE.
We collect data from the SAMPLE and summarize it, to get an ESTIMATE.
That ESTIMATE answers our question about our POPULATION.

Statistics, Analytics, Data Science…huh ❓

Statistics is a discipline (what we study) and a statistic is an estimate based on data.
- This dual meaning confuses people.
- A newer term to describe the discipline of organizing, analyzing, and presenting data is Analytics.
How do these terms, Statistics and Analytics, differ?
- It depends on who you are talking to.
- Analytics (like the Business Analytics major at Whitman) is the more modern term.
- Another (overlapping) term is Data Science which is similar but more encompassing.
Why so many terms? Good Question!
My Opinion: People are unsure about this field and its methods which leads to confusion.

Where do YOU fit in to Statistics and Analytics ❓

Data, statistics, and analytics are essential to everyday life.
- This is especially true for management and business professionals.
- Understanding the pandemic, politics, sports, weather, investments, all require data skills.
Where you (students) are needed:
- You are needed to understand data and communicate statistical information CORRECTLY, HONESTLY, AND ETHICALLY to your peers, and the world at large.
Before you ask…
- Yes, you could google How do I communicate statistics?
- Yes you could write a related question in ChatGPT.
- This class will provide tools and information that those searches don’t provide.
- More to come on how to use ChatGPT effectively and ethically

First Two Terms: Population vs. Sample

What percent of people in each county in the United States have a Bachelor’s degree?

The POPULATION is all people the USA.
Within that population, we have a SUBPOPULATION for each county.
To answer this question, should we talk to EVERY person in every county?
- NO! Instead, the American Community Survey uses an established sample design to attain representative data from each county.
- The SAMPLE is the group of people SELECTED in each county who complete the survey.
- The ESTIMATES from each county’s SAMPLE represent that county’s POPULATION.

What percent of people in each county in the United States have a Bachelor’s degree?

💥 Lecture 1 In-class Exercises - Q1 💥

Session ID: mas261f23

You are writing an article about freshman students at SU to find out how far they are from home and how long it took them to travel to campus.

You randomly select 100 freshman students and ask them some questions to collect your data.

The POPULATION of interest is

A. All students at SU

B. All freshman students at SU

C. All freshman students in the USA

D. The 100 students you selected

💥 Lecture 1 In-class Exercises - Q2 💥

Session ID: mas261f23

You are writing an article about freshman students at SU to find out how far they are from home and how long it took them to travel to campus.

You randomly select 100 freshman students and ask them some questions to collect your data.

The SAMPLE is

A. All students at SU

B. All freshman students at SU

C. All freshman students in the USA

D. The 100 students you selected

🧮 Types of Data - Components of Dataset 🧮

Within a dataset there are different TERMS for for each part of the dataset, e.g. the rows, columns and values.

Each COLUMN is a Variable
Column Labels are Variable Names
Each ROW is an Observation
Individual CELLS (VALUES) are Data or Data Values

Obs_No	Manufacturer	Model	Body_Style	Num_Gears	City_MPG
1	Ford	GT	coupe	7	11
2	Ferrari	458 Speciale	coupe	7	13
3	Ferrari	458 Spider	convertible	7	13
4	Ferrari	458 Italia	coupe	7	13
5	Ferrari	488 GTB	coupe	7	15
6	Ferrari	California	convertible	7	16
7	Ferrari	GTC4Lusso	coupe	7	12
8	Ferrari	FF	coupe	7	11
9	Ferrari	F12Berlinetta	coupe	7	11
10	Ferrari	LaFerrari	coupe	7	12

🧮 Types of Variables in a Dataset 🧮

There are Four main types of data:

🧮 Types of Variables in a Dataset 🧮

Look at BOTH the data definitions (Data Dictionary) and the data values to determine data type.

Variable	Definition
Obs_No	Row ID Number
Manufacturer	Name of Manufacturer
Model	Car Model
Body_Style	Style of Body (4 categories)
Num_Gears	Number of Gears
City_MPG	Average City Miles/Gallon

Obs_No	Manufacturer	Model	Body_Style	Num_Gears	City_MPG
1	Ford	GT	coupe	7	11
2	Ferrari	458 Speciale	coupe	7	13
3	Ferrari	458 Spider	convertible	7	13
4	Ferrari	458 Italia	coupe	7	13
5	Ferrari	488 GTB	coupe	7	15
6	Ferrari	California	convertible	7	16
7	Ferrari	GTC4Lusso	coupe	7	12
8	Ferrari	FF	coupe	7	11
9	Ferrari	F12Berlinetta	coupe	7	11
10	Ferrari	LaFerrari	coupe	7	12

🧮 Types of Variables in a Dataset 🧮

Variable	Definition	Variable Type	Comment
Obs_No	Row ID Number	Categorical Nominal	Numbers are row names
Manufacturer	Name of Manufacturer	Categorical Nominal
Model	Car Model	Categorical Nominal
Body_Style	Style of Body (4 categories)	Categorical Nominal
Num_Gears	Number of Gears	Quantitative Discrete OR Categorical Ordinal	Only a few values
City_MPG	Average City Miles/Gallon	Quantitative Continuous

🧮 How does R define these variables 🧮

# A tibble: 3 × 6
  Obs_No Manufacturer Model        Body_Style  Num_Gears City_MPG
   <int> <chr>        <chr>        <chr>       <fct>        <dbl>
1      1 Ford         GT           coupe       7               11
2      2 Ferrari      458 Speciale coupe       7               13
3      3 Ferrari      458 Spider   convertible 7               13

Type_in_R	Type_Definition	gt_car_Variables	Comment
int	Integer	Obs_No	Row name (ID) ALWAYS Categorical Nominal
chr	Character	Manuafacturer, Model, Body_Style,	All 3 are Categorical Nominal
fct	Factor	Num_Gears	Could be Quantitative Discrete or Categorical Ordinal
dbl	Double Precision (Decimal)	City_MPG	Quantitative Continuous even if decimals not shown

Why define Num_Gears as Categorical Ordinal

If we treat Num_Gears as Categorical, then we can create an informative plot.
Left plot shows average City Miles per Gallon for each body style and number of gears.
The right plot is not useful because Num_Gears data are treated as integers.

Another Data Example - Country Demographics and labor Information

Country	Income_Level	Year	Population	Urban_Pop_Pct	Labor_Force
Lesotho	Lower middle income	2018	2006756	0.34	674904
United States	High income	2018	316740705	0.83	150798813
Angola	Lower middle income	2018	29783592	0.61	10738475
Albania	Upper middle income	2019	2862098	NA	1408749
Argentina	High income	2021	28922467	1.00	13078900

💥 Lecture 1 In-class Exercises - Q3 💥

Session ID: mas261f23

Which variable in the Country data set is categorical and ordinal?

A. Country

B. Income Level

C. Year

D. Population

E. Urban_Pop_Pct

# A tibble: 3 × 6
# Groups:   Country [3]
  Country       Income_Level         Year Population Urban_Pop_Pct Labor_Force
  <chr>         <chr>               <dbl>      <dbl>         <dbl>       <dbl>
1 Lesotho       Lower middle income  2018    2006756          0.34      674904
2 United States High income          2018  316740705          0.83   150798813
3 Angola        Lower middle income  2018   29783592          0.61    10738475

💥 Lecture 1 In-class Exercises - Q4 💥

Session ID: mas261f23

Which variable in the Country data set is categorical and nominal?

A. Country

B. Income Level

C. Year

D. Population

E. Urban_Pop_Pct

# A tibble: 3 × 6
# Groups:   Country [3]
  Country       Income_Level         Year Population Urban_Pop_Pct Labor_Force
  <chr>         <chr>               <dbl>      <dbl>         <dbl>       <dbl>
1 Lesotho       Lower middle income  2018    2006756          0.34      674904
2 United States High income          2018  316740705          0.83   150798813
3 Angola        Lower middle income  2018   29783592          0.61    10738475

💥 Lecture 1 In-class Exercises - Q5 💥

Session ID: mas261f23

Which variable in the Country data set is quantitative and continuous?

A. Country

B. Income Level

C. Year

D. Population

E. Urban_Pop_Pct

# A tibble: 3 × 6
# Groups:   Country [3]
  Country       Income_Level         Year Population Urban_Pop_Pct Labor_Force
  <chr>         <chr>               <dbl>      <dbl>         <dbl>       <dbl>
1 Lesotho       Lower middle income  2018    2006756          0.34      674904
2 United States High income          2018  316740705          0.83   150798813
3 Angola        Lower middle income  2018   29783592          0.61    10738475

💥 Lecture 1 In-class Exercises - Q6 💥

Session ID: mas261f23

How many variables in this dataset are quantitative and discrete?

A. 0

B. 1

C. 2

D. 3

E. 4

# A tibble: 3 × 6
# Groups:   Country [3]
  Country       Income_Level         Year Population Urban_Pop_Pct Labor_Force
  <chr>         <chr>               <dbl>      <dbl>         <dbl>       <dbl>
1 Lesotho       Lower middle income  2018    2006756          0.34      674904
2 United States High income          2018  316740705          0.83   150798813
3 Angola        Lower middle income  2018   29783592          0.61    10738475

Country Variables

Type_in_R	Type_Definition	Variable	Comment
chr	Character	Country	Names are ALWAYS Categorical Nominal
chr	Character	Income_Level	Categorical Ordinal
dbl	Decimal	Year	Year is Quantitative Discrete. Data collected annually. Convert to Integer
dbl	Decimal	Population	Population is Quantitative Discrete. Convert to Integer.
dbl	Decimal	Urban_Pop_Pct	Urban_Pop_Pct is apercentage which is ALWAYS Quantitative Continuous
dbl	Decimal	Labor_Force	Population is Quantitative Discrete. Convert to Integer.

Key Points from Today

Different terms for parts of a data set
- Columns are Variables
- Rows are Observations
- Individual values are Data
4 Main Types of variables
- Categorical Nominal (e.g. names, ID values)
- Categorical Ordinal (e.g Letter Grades, Quality Ratings)
- Quantitative Discrete (e.g., number of pets, years of education)
- Quantitative Continuous (e.g. credit card balance, height, weight)

To submit an Engagement Question or Comment about material from Lecture 1: Submit it by midnight today (day of lecture).

Click on Link next to the ❓ under Lecture 1