Spring 2018

What is Statistics?

  • The science of collecting, organizing, summarizing, analyzing, and interpreting data.

  • Goals of MAT 1140
    • how to interpret data, inter facts, make predictions, and decisions
    • in other words, make you an informed user of numerical information
  • Application of statistics is literally everywhere - Business, finance, engineering, health science, social science, environmental science, politics, education, and so on.

Learning Goals

  1. Descriptive Statistics
    • Describing and summarizing data
  2. Relationship between variables
    • Estimate and interpret regression model
  3. Probability
    • Understanding and quantifying randomness
  4. Inference
    • Making conclusions based on data from random samples

Data

  • Data is a collection of facts, such as values or measurements or more formally a set of qualitative or quantitative variables.

  • Variables
    • A variable is a value or characteristics that can be different from individual to individual.
  • Example
    • Age
    • Sex
    • Race/ethnicity
    • Height
    • Weight
    • Education
    • Income
    • Marital Status
    • Number of children

Structure of a Data File

person_id age sex income education
1 25 M 35,000 AA
2 30 F 48,000 BA
3 22 F 25,000 AA
4 28 M 30,000 BA
5 35 M 48,000 MA
6 41 F 60,000 PhD
. . . . .
1000 20 F 20,000 GED


The dataset has 1,000 person records or cases. A case is also called a unit of observation or an observational unit.

Structure of a Data File…

------------------------------------------------------------------------------------
      name         state    pop2000   pop2010   fed_spend   poverty   homeownership 
---------------- --------- --------- --------- ----------- --------- ---------------
 Autauga County   Alabama    43671     54571      6.068      10.6         77.5      

 Baldwin County   Alabama   140415    182265      6.14       12.2         76.7      

 Barbour County   Alabama    29038     27457      8.752       25           68       

  Bibb County     Alabama    20826     22915      7.122      12.6         82.9      

 Blount County    Alabama    51024     57322      5.131      13.4          82       

 Bullock County   Alabama    11714     10914      9.973      25.3         76.9      

 Butler County    Alabama    21399     20947      9.312       25           69       

 Calhoun County   Alabama   112249    118572      15.44      19.5         70.7      
------------------------------------------------------------------------------------
[1] "COUNTY is the unit of observation"

Descriptive Statistics

Goal: to summarize information contained in a variable or multiple variables

* Graphical Description
* Numerical Summaries

  • Univariate Analysis
    • descriptions and summaries of a single variable, e.g., income
  • Bivariate Analysis
    • analysis of relation between two variables, e.g., income and level of education
  • Multivariate Analysis
    • analysis of relation among more than two variables, e.g., income, level of education, and gender

Types of Variables

1. Numerical or Quantitative

- continuous: a subject or observation takes a value from an interval of real numbers, e.g., weight, height, age, etc.

- discrete: a subject or observation takes certain values from a finite set, e.g. population, traffic volume, etc.


2. Categorical or Qualitative

- nominal (unordered): the data fall into categories that have no particular order or ranking in relation to each other, e.g., color, gender, nationality, etc.

- ordinal (ordered): values have a natural order to ranking, e.g., temperature, exam performance, satisfaction, etc.

Displaying Categorical Variables

Bar Chart

# (1) Area Principle: the area occupied by a part of the graph should correspond to 
        # the magnitude of the value it represents
# (2) Values on the x-axis of a categorical variable have no particular order/ranking

Grouped Bar Chart

Stacked Bar Chart

Stacked Percentage Bar Chart

Bar Chart

Hair vs. Eye Color of Male

Summarizing Categorical Variables

Contingency Table

Distribution of drivers by two categorical variables - age and sex

age male_dvr fem_dvr rowtot
<20 4779 4861 9640
20-29 8690 8841 17531
30-39 8849 9303 18152
40-49 9231 10256 19487
50-59 8281 9539 17820
60+ 6100 7170 13270


Marginal and Conditional Distributions

age male_dvr fem_dvr rowtot male.cond fem.cond tot.margin male.rowp fem.rowp
<20 4779 4861 9640 0.10 0.10 0.10 0.50 0.50
20-29 8690 8841 17531 0.19 0.18 0.18 0.50 0.50
30-39 8849 9303 18152 0.19 0.19 0.19 0.49 0.51
40-49 9231 10256 19487 0.20 0.21 0.20 0.47 0.53
50-59 8281 9539 17820 0.18 0.19 0.19 0.46 0.54
60+ 6100 7170 13270 0.13 0.14 0.14 0.46 0.54

Independence

  • Independence: The distribution of one categorical variable is the same for all categories of another

  • Dependence: For dependent variables, there is an association between the two variables.

age male_dvr fem_dvr male.rowp fem.rowp
<20 4779 4861 0.50 0.50
20-29 8690 8841 0.50 0.50
30-39 8849 9303 0.49 0.51
40-49 9231 10256 0.47 0.53
50-59 8281 9539 0.46 0.54
60+ 6100 7170 0.46 0.54


  • Notice that as age increases, the percentage of female drivers increases. This association suggests that age and sex of drivers are dependent variables.

Contingency Table

Age vs. Blood Pressure (BP)

Table - Number of People Cross-Tabulated by Age and BP

-------------------------------------------------
   BP     Age_Under_30   Age_30_49   Age_Over_50 
-------- -------------- ----------- -------------
  Low          27           37           23      

 Normal        48           91           51      

  High         23           51           73      
-------------------------------------------------

Find:

  1. Marginal distribution of blood pressure level
  2. Conditional distribution of blood pressure level with each age group
  3. Association between age and blood pressure
  4. Compare these distributions with a segmented bar graph

Solutions

  1. Marginal distribution of blood pressure level
    BP Age_Under_30 Age_30_49 Age_Over_50 total marginal
    Low 27 37 23 87 0.21
    Normal 48 91 51 190 0.45
    High 23 51 73 147 0.35


  2. Conditional distribution of blood pressure level with each age group
    BP Age_Under_30 Age_30_49 Age_Over_50 total marginal con_Under_30 con_30_49 con_Over_50
    Low 27 37 23 87 0.21 0.28 0.21 0.16
    Normal 48 91 51 190 0.45 0.49 0.51 0.35
    High 23 51 73 147 0.35 0.23 0.28 0.50


  3. As age increases, the percent of adults with high blood pressure increases. By contrast, the percent of adults with low blood pressure decreases.


Solutions…

  1. Compare these distributions with a segmented bar graph


Practice

person_id Males Accepted (of Applicants) Females Accepted (of Applicants)
1 511 of 825 89 of 108
2 353 of 560 17 of 25
3 137 of 407 132 of 375
4 22 of 373 24 of 341
Total 1022 of 2165 262 of 849


Find:

  1. Marginal distribution of blood pressure level
  2. Conditional distribution of blood pressure level with each age group
  3. Association between age and blood pressure
  4. Compare these distributions with a segmented bar graph


Next Week


Chapter 3: Displaying Quantitative Variables
Chapter 4: Understanding and Comparing Distributions