Course Introduction and Types of Data
2025-08-11
Introduction 👋
Syllabus 📃
Navigating Slides ⬇️
Today’s plan 📋
Navigating R and RStudio 🪄
General Introduction to Statistics and Analytics 📈
Population vs. Sample 🧮
Types of Data 🧮
I grew up here and went to SU and
then I …
Now I …
In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
I will demo how to download completed work so that you can use this allotment efficiently.
For those who want to go further with R/RStudio:
We start with a POPULATION that we have a question about.
We select a subset from that population, called a SAMPLE.
We collect data from the SAMPLE and summarize it, to get an ESTIMATE.
That ESTIMATE answers our question about our POPULATION.
Statistics is a discipline (what we study) and a statistic is an estimate based on data.
This dual meaning confuses people.
A newer term to describe the discipline of organizing, analyzing, and presenting data is Analytics.
How do these terms, Statistics and Analytics, differ?
Why so many terms? Good Question!
What I Suggest: Allow clients, colleagues, employers to use terms they are comfortable with.
Data, statistics, and analytics are essential to everyday life.
Where you (students) are needed:
You could google How do I communicate statistics?
You could write a similar question in your AI tool of choice.
This class will provide tools and information that those searches don’t provide.
What percent of people in each county in the United States have a Bachelor’s degree?
The POPULATION is all people the USA.
Within that population, we have a SUBPOPULATION for each county.
To answer this question, should we talk to EVERY person in every county?
NO! Instead, the American Community Survey uses an established sample design to attain representative data from each county.
The SAMPLE is the group of people SELECTED in each county who complete the survey.
The ESTIMATES from each county’s SAMPLE represent that county’s POPULATION.
You are writing an article about freshman students at SU to find out how far they are from home and how long it took them to travel to campus.
You randomly select 100 freshman students and ask them some questions to collect your data.
The POPULATION of interest is
A. All students at SU
B. All freshman students at SU
C. All freshman students in the USA
D. The 100 students you selected
You are writing an article about freshman students at SU to find out how far they are from home and how long it took them to travel to campus.
You randomly select 100 freshman students and ask them some questions to collect your data.
The SAMPLE is
A. All students at SU
B. All freshman students at SU
C. All freshman students in the USA
D. The 100 students you selected
Within a dataset there are different TERMS for for each part of the dataset, e.g. the rows, columns and values.
Each COLUMN is a VARIABLE.
COLUMN LABELS are VARIABLE NAMES.
Each ROWis an OBSERVATION.
Individual CELLS (VALUES) are DATA or DATA VALUES.
Obs_No | Manufacturer | Model | Body_Style | Num_Gears | City_MPG |
---|---|---|---|---|---|
1 | Ford | GT | coupe | 7 | 11 |
2 | Ferrari | 458 Speciale | coupe | 7 | 13 |
3 | Ferrari | 458 Spider | convertible | 7 | 13 |
4 | Ferrari | 458 Italia | coupe | 7 | 13 |
5 | Ferrari | 488 GTB | coupe | 7 | 15 |
6 | Ferrari | California | convertible | 7 | 16 |
7 | Ferrari | GTC4Lusso | coupe | 7 | 12 |
8 | Ferrari | FF | coupe | 7 | 11 |
9 | Ferrari | F12Berlinetta | coupe | 7 | 11 |
10 | Ferrari | LaFerrari | coupe | 7 | 12 |
There are Four main types of data:
Look at BOTH the data definitions (Data Dictionary) and the data values to determine data type.
Variable | Definition |
---|---|
Obs_No | Row ID Number |
Manufacturer | Name of Manufacturer |
Model | Car Model |
Body_Style | Style of Body (4 categories) |
Num_Gears | Number of Gears |
City_MPG | Average City Miles/Gallon |
Obs_No | Manufacturer | Model | Body_Style | Num_Gears | City_MPG |
---|---|---|---|---|---|
1 | Ford | GT | coupe | 7 | 11 |
2 | Ferrari | 458 Speciale | coupe | 7 | 13 |
3 | Ferrari | 458 Spider | convertible | 7 | 13 |
4 | Ferrari | 458 Italia | coupe | 7 | 13 |
5 | Ferrari | 488 GTB | coupe | 7 | 15 |
6 | Ferrari | California | convertible | 7 | 16 |
7 | Ferrari | GTC4Lusso | coupe | 7 | 12 |
8 | Ferrari | FF | coupe | 7 | 11 |
9 | Ferrari | F12Berlinetta | coupe | 7 | 11 |
10 | Ferrari | LaFerrari | coupe | 7 | 12 |
Variable | Definition | Variable Type | Comment |
---|---|---|---|
Obs_No | Row ID Number | Categorical Nominal | Numbers are row names |
Manufacturer | Name of Manufacturer | Categorical Nominal | |
Model | Car Model | Categorical Nominal | |
Body_Style | Style of Body (4 categories) | Categorical Nominal | |
Num_Gears | Number of Gears | Quantitative Discrete OR Categorical Ordinal | Only a few values |
City_MPG | Average City Miles/Gallon | Quantitative Continuous |
# A tibble: 3 × 6
Obs_No Manufacturer Model Body_Style Num_Gears City_MPG
<int> <chr> <chr> <chr> <fct> <dbl>
1 1 Ford GT coupe 7 11
2 2 Ferrari 458 Speciale coupe 7 13
3 3 Ferrari 458 Spider convertible 7 13
Type_in_R | Type_Definition | gt_car_Variables | Comment |
---|---|---|---|
int | Integer | Obs_No | Row name (ID) ALWAYS Categorical Nominal |
chr | Character | Manuafacturer, Model, Body_Style, | All 3 are Categorical Nominal |
fct | Factor | Num_Gears | Could be Quantitative Discrete or Categorical Ordinal |
dbl | Double Precision (Decimal) | City_MPG | Quantitative Continuous even if decimals not shown |
If we treat Num_Gears
as Categorical, then we can create an informative plot.
Left plot shows average City Miles per Gallon for each body style and number of gears.
The right plot is not useful because Num_Gears
data are treated as integers.
Country Demographics and Labor Information
Country | Income_Level | Year | Population | Urban_Pop_Pct | Labor_Force |
---|---|---|---|---|---|
Lesotho | Lower middle income | 2018 | 2006756 | 0.34 | 674904 |
United States | High income | 2018 | 316740705 | 0.83 | 150798813 |
Angola | Lower middle income | 2018 | 29783592 | 0.61 | 10738475 |
Albania | Upper middle income | 2019 | 2862098 | NA | 1408749 |
Argentina | High income | 2021 | 28922467 | 1.00 | 13078900 |
Which variable in the Country data set is categorical and ordinal?
A. Country
B. Income Level
C. Year
D. Population
E. Urban_Pop_Pct
# A tibble: 3 × 6
# Groups: Country [3]
Country Income_Level Year Population Urban_Pop_Pct Labor_Force
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Lesotho Lower middle income 2018 2006756 0.34 674904
2 United States High income 2018 316740705 0.83 150798813
3 Angola Lower middle income 2018 29783592 0.61 10738475
Which variable in the Country data set is categorical and nominal?
A. Country
B. Income Level
C. Year
D. Population
E. Urban_Pop_Pct
# A tibble: 3 × 6
# Groups: Country [3]
Country Income_Level Year Population Urban_Pop_Pct Labor_Force
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Lesotho Lower middle income 2018 2006756 0.34 674904
2 United States High income 2018 316740705 0.83 150798813
3 Angola Lower middle income 2018 29783592 0.61 10738475
Which variable in the Country data set is quantitative and continuous?
A. Country
B. Income Level
C. Year
D. Population
E. Urban_Pop_Pct
# A tibble: 3 × 6
# Groups: Country [3]
Country Income_Level Year Population Urban_Pop_Pct Labor_Force
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Lesotho Lower middle income 2018 2006756 0.34 674904
2 United States High income 2018 316740705 0.83 150798813
3 Angola Lower middle income 2018 29783592 0.61 10738475
How many variables in this dataset are quantitative and discrete?
A. 0
B. 1
C. 2
D. 3
E. 4
# A tibble: 3 × 6
# Groups: Country [3]
Country Income_Level Year Population Urban_Pop_Pct Labor_Force
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Lesotho Lower middle income 2018 2006756 0.34 674904
2 United States High income 2018 316740705 0.83 150798813
3 Angola Lower middle income 2018 29783592 0.61 10738475
Type_in_R | Type_Definition | Variable | Comment |
---|---|---|---|
chr | Character | Country | Names are ALWAYS Categorical Nominal |
chr | Character | Income_Level | Categorical Ordinal |
dbl | Decimal | Year | Year is Quantitative Discrete. Data collected annually. Convert to Integer |
dbl | Decimal | Population | Population is Quantitative Discrete. Convert to Integer. |
dbl | Decimal | Urban_Pop_Pct | Urban_Pop_Pct is apercentage which is ALWAYS Quantitative Continuous |
dbl | Decimal | Labor_Force | Population is Quantitative Discrete. Convert to Integer. |
Different terms for parts of a data set
4 Main Types of variables
To submit an Engagement Question or Comment about material from Lecture 1: Submit it by midnight today (day of lecture).