Housekeeping

  • Introduction 👋

  • Syllabus 📃

  • Navigating Slides ⬇️

  • Today’s plan 📋

    • Navigating R and RStudio 🪄

    • General Introduction to Statistics and Analytics 📈

    • Population vs. Sample 🧮

    • Types of Data 🧮

A little about me… 👋

I grew up here and went to SU and

  • then I …

    • worked in Scotland, Slovakia, Lithuania, Chile.
    • traveled all over…
    • went to graduate school in Oregon and Virginia.
    • worked in federal gov’t and private sector.
  • Now I …

    • still do consulting and some research in analytics.
    • also helped create the Business Analytics major here at Whitman.

Introduction to R and RStudio 🪄

  • In this course we will use R and RStudio to understand statistical concepts.

  • You will access R and RStudio through Posit Cloud.

  • I will post R/RStudio files on Posit Cloud that you can access in provided links.

  • I will also provide demo videos that show how to access files and complete exercises.

  • NOTE: The free Posit Cloud account is limited to 25 hours per month.

    • I will demo how to download completed work so that you can use this allotment efficiently.

    • For those who want to go further with R/RStudio:

      • After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer

What is Statistics, Analytics, etc.? 📈

  • Statistics (the discipline) allows us to answer questions about (almost) anything we want to know that we can collect data for.
  • We start with a POPULATION that we have a question about.

  • We select a subset from that population, called a SAMPLE.

  • We collect data from the SAMPLE and summarize it, to get an ESTIMATE.

  • That ESTIMATE answers our question about our POPULATION.

Statistics, Analytics, Data Science…huh

  • Statistics is a discipline (what we study) and a statistic is an estimate based on data.

    • This dual meaning confuses people.

    • A newer term to describe the discipline of organizing, analyzing, and presenting data is Analytics.

  • How do these terms, Statistics and Analytics, differ?

    • It depends on who you are talking to.
    • Analytics (like the Business Analytics major at Whitman) is the more modern term.
    • Another (overlapping) term is Data Science which is similar but more encompassing.
  • Why so many terms? Good Question!

  • What I Suggest: Allow clients, colleagues, employers to use terms they are comfortable with.

Where do YOU fit in to Statistics and Analytics

  • Data, statistics, and analytics are essential to everyday life.

    • This is especially true for management and business professionals.
    • Understanding the pandemic, politics, sports, weather, investments, all require data skills.
  • Where you (students) are needed:

    • You are needed to understand data and communicate statistical information CORRECTLY, HONESTLY, AND ETHICALLY to your peers, and the world at large.
  • You could google How do I communicate statistics?

  • You could write a similar question in your AI tool of choice.

  • This class will provide tools and information that those searches don’t provide.

    • More to come on how to use AI effectively and ethically in this course.

First Two Terms: Population vs. Sample

What percent of people in each county in the United States have a Bachelor’s degree?

  • The POPULATION is all people the USA.

  • Within that population, we have a SUBPOPULATION for each county.

  • To answer this question, should we talk to EVERY person in every county?

    • NO! Instead, the American Community Survey uses an established sample design to attain representative data from each county.

    • The SAMPLE is the group of people SELECTED in each county who complete the survey.

    • The ESTIMATES from each county’s SAMPLE represent that county’s POPULATION.

Bachelor’s Degrees by County

  • What percent of people in each county of the United States have a Bachelor’s degree?

💥Lecture 1 In-class Exercises - Q1 💥

Poll Everywhere

You are writing an article about freshman students at SU to find out how far they are from home and how long it took them to travel to campus.

You randomly select 100 freshman students and ask them some questions to collect your data.

The POPULATION of interest is

A. All students at SU

B. All freshman students at SU

C. All freshman students in the USA

D. The 100 students you selected

💥Lecture 1 In-class Exercises - Q2 💥

Poll Everywhere

You are writing an article about freshman students at SU to find out how far they are from home and how long it took them to travel to campus.

You randomly select 100 freshman students and ask them some questions to collect your data.

The SAMPLE is

A. All students at SU

B. All freshman students at SU

C. All freshman students in the USA

D. The 100 students you selected

Components of Dataset

Within a dataset there are different TERMS for for each part of the dataset, e.g. the rows, columns and values.

  • Each COLUMN is a VARIABLE.

  • COLUMN LABELS are VARIABLE NAMES.

  • Each ROWis an OBSERVATION.

  • Individual CELLS (VALUES) are DATA or DATA VALUES.

Obs_No Manufacturer Model Body_Style Num_Gears City_MPG
1 Ford GT coupe 7 11
2 Ferrari 458 Speciale coupe 7 13
3 Ferrari 458 Spider convertible 7 13
4 Ferrari 458 Italia coupe 7 13
5 Ferrari 488 GTB coupe 7 15
6 Ferrari California convertible 7 16
7 Ferrari GTC4Lusso coupe 7 12
8 Ferrari FF coupe 7 11
9 Ferrari F12Berlinetta coupe 7 11
10 Ferrari LaFerrari coupe 7 12

Types of Variables in a Dataset

There are Four main types of data:

Types of Variables in a Dataset

Look at BOTH the data definitions (Data Dictionary) and the data values to determine data type.

Variable Definition
Obs_No Row ID Number
Manufacturer Name of Manufacturer
Model Car Model
Body_Style Style of Body (4 categories)
Num_Gears Number of Gears
City_MPG Average City Miles/Gallon
Obs_No Manufacturer Model Body_Style Num_Gears City_MPG
1 Ford GT coupe 7 11
2 Ferrari 458 Speciale coupe 7 13
3 Ferrari 458 Spider convertible 7 13
4 Ferrari 458 Italia coupe 7 13
5 Ferrari 488 GTB coupe 7 15
6 Ferrari California convertible 7 16
7 Ferrari GTC4Lusso coupe 7 12
8 Ferrari FF coupe 7 11
9 Ferrari F12Berlinetta coupe 7 11
10 Ferrari LaFerrari coupe 7 12

Types of Variables in a Dataset

Variable Definition Variable Type Comment
Obs_No Row ID Number Categorical Nominal Numbers are row names
Manufacturer Name of Manufacturer Categorical Nominal
Model Car Model Categorical Nominal
Body_Style Style of Body (4 categories) Categorical Nominal
Num_Gears Number of Gears Quantitative Discrete OR Categorical Ordinal Only a few values
City_MPG Average City Miles/Gallon Quantitative Continuous

How does R define these variables 🧮

# A tibble: 3 × 6
  Obs_No Manufacturer Model        Body_Style  Num_Gears City_MPG
   <int> <chr>        <chr>        <chr>       <fct>        <dbl>
1      1 Ford         GT           coupe       7               11
2      2 Ferrari      458 Speciale coupe       7               13
3      3 Ferrari      458 Spider   convertible 7               13


Type_in_R Type_Definition gt_car_Variables Comment
int Integer Obs_No Row name (ID) ALWAYS Categorical Nominal
chr Character Manuafacturer, Model, Body_Style, All 3 are Categorical Nominal
fct Factor Num_Gears Could be Quantitative Discrete or Categorical Ordinal
dbl Double Precision (Decimal) City_MPG Quantitative Continuous even if decimals not shown

Why define Num_Gears as Categorical Ordinal?

  • If we treat Num_Gears as Categorical, then we can create an informative plot.

  • Left plot shows average City Miles per Gallon for each body style and number of gears.

  • The right plot is not useful because Num_Gears data are treated as integers.

Another Data Example

Country Demographics and Labor Information

Country Income_Level Year Population Urban_Pop_Pct Labor_Force
Lesotho Lower middle income 2018 2006756 0.34 674904
United States High income 2018 316740705 0.83 150798813
Angola Lower middle income 2018 29783592 0.61 10738475
Albania Upper middle income 2019 2862098 NA 1408749
Argentina High income 2021 28922467 1.00 13078900

💥Lecture 1 In-class Exercises - Q3 💥

Poll Everywhere

Which variable in the Country data set is categorical and ordinal?

A. Country

B. Income Level

C. Year

D. Population

E. Urban_Pop_Pct


# A tibble: 3 × 6
# Groups:   Country [3]
  Country       Income_Level         Year Population Urban_Pop_Pct Labor_Force
  <chr>         <chr>               <dbl>      <dbl>         <dbl>       <dbl>
1 Lesotho       Lower middle income  2018    2006756          0.34      674904
2 United States High income          2018  316740705          0.83   150798813
3 Angola        Lower middle income  2018   29783592          0.61    10738475

💥Lecture 1 In-class Exercises - Q4 💥

Poll Everywhere

Which variable in the Country data set is categorical and nominal?


A. Country

B. Income Level

C. Year

D. Population

E. Urban_Pop_Pct


# A tibble: 3 × 6
# Groups:   Country [3]
  Country       Income_Level         Year Population Urban_Pop_Pct Labor_Force
  <chr>         <chr>               <dbl>      <dbl>         <dbl>       <dbl>
1 Lesotho       Lower middle income  2018    2006756          0.34      674904
2 United States High income          2018  316740705          0.83   150798813
3 Angola        Lower middle income  2018   29783592          0.61    10738475

💥Lecture 1 In-class Exercises - Q5 💥

Poll Everywhere

Which variable in the Country data set is quantitative and continuous?


A. Country

B. Income Level

C. Year

D. Population

E. Urban_Pop_Pct


# A tibble: 3 × 6
# Groups:   Country [3]
  Country       Income_Level         Year Population Urban_Pop_Pct Labor_Force
  <chr>         <chr>               <dbl>      <dbl>         <dbl>       <dbl>
1 Lesotho       Lower middle income  2018    2006756          0.34      674904
2 United States High income          2018  316740705          0.83   150798813
3 Angola        Lower middle income  2018   29783592          0.61    10738475

💥Lecture 1 In-class Exercises - Q6 💥

Poll Everywhere

How many variables in this dataset are quantitative and discrete?


A. 0

B. 1

C. 2

D. 3

E. 4


# A tibble: 3 × 6
# Groups:   Country [3]
  Country       Income_Level         Year Population Urban_Pop_Pct Labor_Force
  <chr>         <chr>               <dbl>      <dbl>         <dbl>       <dbl>
1 Lesotho       Lower middle income  2018    2006756          0.34      674904
2 United States High income          2018  316740705          0.83   150798813
3 Angola        Lower middle income  2018   29783592          0.61    10738475

Country Variables

Type_in_R Type_Definition Variable Comment
chr Character Country Names are ALWAYS Categorical Nominal
chr Character Income_Level Categorical Ordinal
dbl Decimal Year Year is Quantitative Discrete. Data collected annually. Convert to Integer
dbl Decimal Population Population is Quantitative Discrete. Convert to Integer.
dbl Decimal Urban_Pop_Pct Urban_Pop_Pct is apercentage which is ALWAYS Quantitative Continuous
dbl Decimal Labor_Force Population is Quantitative Discrete. Convert to Integer.