Elementary Statistics

Math 404 SMPU

Sir Calvin A. Gaye

2024-10-14

1 Introduction to Statistics

1.1 An Overview of Statistics

What You Should Learn

A definition of statistics
How to distinguish between a population and a sample and between a parameter and a statistic
How to distinguish between descriptive statistics and inferential statistics

A Definition of Statistics

Data consist of information coming from observations, counts, measurements, or responses.
Statistics is the science of collecting, organizing, analyzing, and interpreting data in order to make decisions.

Data Sets

There are two types of data sets you will use when studying statistics. These data sets are called populations and samples.

A population is the collection of all outcomes, responses, measurements, or counts that are of interest.
A sample is a subset, or part, of a population.

Example 1: Identifying Data Sets

In a survey, 834 employees in the United States were asked whether they thought their jobs were highly stressful. Of the 834 respondents, 517 said yes. Identify the population and the sample. Describe the sample data set.
$\color{forestgreen}{\textbf{Solution 1 |}}$
- The population consists of the responses of all employees in the United States.
- The sample consists of the responses of the 834 employees in the survey.
- The sample data set consists of 517 people who said yes and 317 who said no.

Try It Yourself 1 |

In a survey of 1501 ninth to twelfth graders in the United States, 1215 said “leaders today are more concerned with their own agenda than with achieving the overall goals of the organization they serve.” Identify the population and the sample. Describe the sample data set.

Solution 1 | Try It Yourself

The population consists of the responses of all ninth to twelfth graders in the United States.
The sample consists of the responses of the 1501 ninth to twelfth graders in the survey.
The sample data set consists of 1215 ninth to twelfth graders who said leaders today are more concerned with their own agenda than with achieving the overall goals of the organization they serve and 286 ninth to twelfth graders who did not say that.

Data Sets | Parameter and Statistic

A parameter is a numerical description of a population characteristic.
A statistic is a numerical description of a sample characteristic.

Example 2 | Distinguishing Between a Parameter and a Statistic

Determine whether each number describes a population parameter or a sample statistic. Explain your reasoning.
- A survey of several hundred collegiate student-athletes in the United States found that, during the season of their sport, the average time spent on athletics by student-athletes is 50 hours per week.
- The freshman class at a university has an average WASSCE math score of 40%.

Example 2 | Solution

Because the average of 50 hours per week is based on a subset of the population, it is a sample statistic.
Because the average WASSCE math score of 40% is based on the entire freshman class, it is a population parameter.

Try It Yourself 2 |

Determine whether each number describes a population parameter or a sample statistic. Explain your reasoning.
- In a random check of several hundred retail stores, the Food and Drug Administration found that 34% of the stores were not storing fish at the proper temperature.
- Last year, a small company spent a total of $5,150,694 on employees’ salaries.

Solution 2 | Try It Yourself

Because 34% is based on a subset of the population, it is a sample statistic.
Population parameter, because the total spent on employees’ salaries, $5,150,694, is based on the entire company.

Branches of Statistics

The study of statistics has two major branches: descriptive statistics and inferential statistics.
- Descriptive statistics is the branch of statistics that involves the organization, summarization, and display of data.
- Inferential statistics is the branch of statistics that involves using a sample to draw conclusions about a population. A basic tool in the study of inferential statistics is probability. (You will learn more about probability in Chapter 3.)

Example 3 | Descriptive and Inferential Statistics

For each study, identify the population and the sample. Then determine which part of the study represents the descriptive branch of statistics. What conclusions might be drawn from the study using inferential statistics?
- A study of 300 Wall Street analysts found that the percentage who incorrectly forecasted high-tech earnings in a recent year was 44%

Example 3 | Solution

The population consists of the high-tech earnings forecasts of all Wall Street analysts, and
the sample consists of the forecasts of the 300 Wall Street analysts in the study.
The part of this study that represents the descriptive branch of statistics involves the statement “the percentage [of Wall Street analysts] who incorrectly forecasted high-tech earnings in a recent year was 44%.”
A possible inference drawn from the study is that the stock market is difficult to forecast, even for professionals.

Try It Yourself 3 |

A study of 1000 U.S. adults found that when they have a question about their medication, three out of four adults will consult with their physician or pharmacist and only 8% visit a medication-specific website.
- Identify the population and the sample.
- Determine which part of the study represents the descriptive branch of statistics.
- What conclusions might be drawn from the study using inferential statistics?

Solution 3 | Try It Yourself

The population consists of the responses of all U.S. adults, and the sample consists of the responses of the 1000 U.S. adults in the study.
The part of this study that represents the descriptive branch of statistics involves the statement “three out of four adults will consult with their physician or pharmacist and only 8% visit a medication-specific website [when they have a question about their medication].”
A possible inference drawn from the study is that most adults consult with their physician or pharmacist when they have a question about their medication.

1.2 Data Classification

What You Should Learn

How to distinguish between qualitative data and quantitative data
How to classify data with respect to the four levels of measurement: nominal, ordinal, interval, and ratio.

Types of Data

Data sets can consist of two types of data: qualitative data and quantitative data.
- Qualitative data consist of attributes, labels, or nonnumerical entries.
- Quantitative data consist of numbers that are measurements or counts.

Example 1 | Classifying Data by Type

The table shows sports-related head injuries treated in U.S. emergency rooms during a recent five-year span for several sports. Which data are qualitative data and which are quantitative data? Explain your reasoning.

Types of Data

code

library(flextable)
library(tidyverse)
library(formattable)
library(gt)
library(DT)
library(reactablefmtr)
sport = c('Basketball', 'Baseball', 'Football', 'Gymnastics', 'Hockey', 'Soccer', 'Softball', 'Swimming', 'Volleyball')

head = c(131930, 83522, 220258, 33265, 41450, 98710, 41216, 44815, 13848)

dataHIT = data.frame(sport, head)
hit = dataHIT |>
  rename("head injuries treated" = head)

knitr::kable(hit, format = "pandoc", caption = "Sports-Related Head Injuries")

Sports-Related Head Injuries
sport	head injuries treated
Basketball	131930
Baseball	83522
Football	220258
Gymnastics	33265
Hockey	41450
Soccer	98710
Softball	41216
Swimming	44815
Volleyball	13848

Example 1 | Solution

The information shown in the table can be separated into two data sets. One data set contains the names of sports, and the other contains the numbers of head injuries treated. The names are nonnumerical entries, so these are qualitative data. The numbers of head injuries treated are numerical entries, so these are quantitative data.

Try It Yourself 1 |

The populations of several Liberia counties are shown in the table. Which data are qualitative data and which are quantitative data? Explain your reasoning. (Source: Liberia Lisgis)

Try It Yourself 1

code

library(flextable)
library(tidyverse)
library(formattable)
library(gt)
library(DT)
library(reactablefmtr)
County = c('Bong', 'Grand Bassa', 'Grand Gedeh', 'Lofa', 'Montserrado', 'Nimba')

Population = c(467561, 293689, 216692, 367376, 1920965, 621841)

dataCounty = data.frame(County, Population)


knitr::kable(dataCounty, format = "pandoc", caption = "Liberia County Populations")

Liberia County Populations
County	Population
Bong	467561
Grand Bassa	293689
Grand Gedeh	216692
Lofa	367376
Montserrado	1920965
Nimba	621841

Solution 1 | Try It Yourself

The information shown in the table can be separated into two data sets. One data set contains the names of counties, and the other contains the numbers of people (populations). The names are nonnumerical entries, so these are qualitative data. The numbers of people are numerical entries, so these are quantitative data.

Levels of Measurement

Another characteristic of data is its level of measurement. The level of measurement determines which statistical calculations are meaningful. The four levels of measurement, in order from lowest to highest, are nominal, ordinal, interval, and ratio.

Nominal and Ordinal

Data at the nominal level of measurement are qualitative only. Data at this level are categorized using names, labels, or qualities. No mathematical computations can be made at this level.
Data at the ordinal level of measurement are qualitative or quantitative. Data at this level can be arranged in order, or ranked, but differences between data entries are not meaningful.

Example 2 | Classifying Data by Level

For each data set, determine whether the data are at the nominal level or at the ordinal level. Explain your reasoning

code

library(flextable)
library(tidyverse)
library(formattable)
library(gt)
library(DT)
library(reactablefmtr)
head1 = c('1. Personal care aides', '2. Registered nurses', '3. Home health aides', '4. Combined food preparation and serving workers, including fast food', '5. Retail salespersons')


dataH = data.frame(head1)

dataH2 = dataH |>
  rename("Top five U.S. occupations with the most job growth (projected 2024)" = head1)

knitr::kable(dataH2, format = "pandoc", caption = "")

Top five U.S. occupations with the most job growth (projected 2024)
1. Personal care aides
2. Registered nurses
3. Home health aides
4. Combined food preparation and serving workers, including fast food
5. Retail salespersons

Example 2 | Classifying Data by Level

code

library(flextable)
library(tidyverse)
library(formattable)
library(gt)
library(DT)
library(reactablefmtr)
movie = c('Action', 'Adventure', 'Comedy', 'Drama', 'Horror')


movie1 = data.frame(movie)

movie2 = movie1 |>
  rename("Movie genres" = movie)

knitr::kable(movie2, format = "pandoc", caption = "")

Movie genres
Action
Adventure
Comedy
Drama
Horror

Example 2 | Solution

1. This data set lists the ranks of the five fastest-growing occupations in the U.S. over the next few years. The data set consists of the ranks 1, 2, 3, 4, and 5. Because the ranks can be listed in order, these data are at the ordinal level. Note that the difference between a rank of 1 and 5 has no mathematical meaning.
2. This data set consists of the names of movie genres. No mathematical computations can be made with the names, and the names cannot be ranked, so these data are at the nominal level.

Interval and Ratio

The two highest levels of measurement consist of quantitative data only.
- Data at the interval level of measurement can be ordered, and meaningful differences between data entries can be calculated. At the interval level, a zero entry simply represents a position on a scale; the entry is not an inherent zero.
- Data at the ratio level of measurement are similar to data at the interval level, with the added property that a zero entry is an inherent zero. A ratio of two data entries can be formed so that one data entry can be meaningfully expressed as a multiple of another.

Example 3 | Classifying Data by Level

Two data sets are shown at the left. Which data set consists of data at the interval level? Which data set consists of data at the ratio level? Explain your reasoning.
Data set 1

Example 3 | Classifying Data by Level

Data set 2

Example 3 | Solution

Both of these data sets contain quantitative data. Consider the dates of the Yankees’ World Series victories. It makes sense to find differences between specific dates. For instance, the time between the Yankees’ first and last World Series victories is
- \[\begin{align} 2009 - 1923 = 86 \text{ years} \end{align}\]

Example 3 | Solution

But it does not make sense to say that one year is a multiple of another. So, these data are at the interval level. However, using the home run totals, you can find differences and write ratios. For instance, Boston hit 22 more home runs than Cleveland hit because 81 - 59 = 22 home runs. Also, Chicago hit about 1.25 times as many home runs as Baltimore hit because
- \[\begin{align} \frac{96}{77} \approx 1.25. \end{align}\]
o, these data are at the ratio level.

Try It Yourself 2 |

For each data set, determine whether the data are at the nominal level or at the ordinal level. Explain your reasoning.
- The final standings for the Pacific Division of the National Basketball Association
- A collection of phone numbers

Try It Yourself 3 |

For each data set, determine whether the data are at the interval level or at the ratio level. Explain your reasoning.
- The body temperatures (in degrees Fahrenheit) of an athlete during an exercise session
- The heart rates (in beats per minute) of an athlete during an exercise session

Solution 2 | Try It Yourself

Ordinal, because the data can be put in order.
Nominal, because no mathematical computations can be made.

Solution 3 | Try It Yourself

Interval, because the data can be ordered and meaningful differences can be calculated, but it does not make sense to write a ratio using the temperatures.
Ratio, because the data can be ordered, meaningful differences can be calculated, the data can be written as a ratio, and the data set contains an inherent zero.

1.3 Data Collection and Experimental Design

What You Should Learn

How to design a statistical study and how to distinguish between an observational study and an experiment
How to collect data by using a survey or a simulation
How to design an experiment
How to create a sample using random sampling, simple random sampling, stratified sampling, cluster sampling, and systematic sampling and how to identify a biased sample

Design of a Statistical Study | GUIDELINES

Identify the variable(s) of interest (the focus) and the population of the study.
Develop a detailed plan for collecting data. If you use a sample, make sure the sample is representative of the population.
Collect the data.
Describe the data, using descriptive statistics techniques.
Interpret the data and make decisions about the population using inferential statistics.
Identify any possible errors.

Design of a Statistical Study

A statistical study can usually be categorized as an observational study or an experiment.
In an observational study, a researcher does not influence the responses.
In an experiment, a researcher deliberately applies a treatment before observing the responses. Here is a brief summary of these types of studies.

Obseervational Study

In an observational study, a researcher observes and measures characteristics of interest of part of a population but does not change existing conditions.
For instance, an observational study was conducted in which researchers measured the amount of time people spent doing various activities, such as volunteering, paid work, childcare, and socializing.

Experiment

In performing an experiment, a treatment is applied to part of a population, called a treatment group, and responses are observed.
Another part of the population may be used as a control group, in which no treatment is applied. (The subjects in both groups are called experimental units.)
In many cases, subjects in the control group are given a placebo, which is a harmless, fake treatment that is made to look like the real treatment.
The responses of both groups can then be compared and studied.

Example 1 a. | Distinguishing Between an Observational Study and an Experiment

Determine whether each study is an observational study or an experiment.
- Researchers study the effect of vitamin D3 supplementation among patients who were newly diagnosed with a viral infection. To perform the study, researchers give 2700 U.S. adults either a daily vitamin D3 supplement or a placebo for four weeks.

Example 1 b. | Distinguishing Between an Observational Study and an Experiment

Determine whether each study is an observational study or an experiment.
- Researchers conduct a study to determine how confident Americans are in the U.S. economy. To perform the study, researchers call 1019 U.S. adults and ask them to rate current U.S. economic conditions and whether the U.S. economy is getting better or worse.

Example 1 | Solution

1 a. Because the study applies a treatment (vitamin D3) to the subjects, the study is an experiment.
2 b. Because the study does not attempt to influence the responses of the subjects (there is no treatment), the study is an observational study.

Try It Yourself 1 |

The Pennsylvania Game Commission conducted a study to determine the percentage of the Pennsylvania elk population in each age and sex class. The commission captured and released elk during each year of the study and found an overall average of 16% branched bulls, 7% spike bulls, 56% adult cows, and 21% calves. Is this study an observational study or an experiment?

Solution 1 | Try It Yourself

This is an observational study.

Data Collection

There are several ways to collect data. Often, the focus of the study dictates the best way to collect data. Here is a brief summary of two methods of data collection.

Simulation

A simulation is the use of a mathematical or physical model to reproduce the conditions of a situation or process. Collecting data often involves the use of computers. Simulations allow you to study situations that are impractical or even dangerous to create in real life, and often they save time and money. For instance, automobile manufacturers use simulations with dummies to study the effects of crashes on humans. Throughout this course, you will have the opportunity to use applets that simulate statistical processes on a computer.

Survey

A survey is an investigation of one or more characteristics of a population. Most often, surveys are carried out on people by asking them questions. The most common types of surveys are done by interview, Internet, phone, or mail. In designing a survey, it is important to word the questions so that they do not lead to biased results, which are not representative of a population. For instance, a survey is conducted on a sample of physicians to determine whether the primary reason for their career choice is financial stability. In designing the survey, it would be acceptable to make a list of reasons and ask each individual in the sample to select their first choice.

Experimental Design

Three key elements of a well-designed experiment are control, randomization, and replication.

Confounding variable

Because experimental results can be ruined by a variety of factors, being able to control these influential factors is important. One such factor is a confounding variable.
A confounding variable occurs when an experimenter cannot tell the difference between the effects of different factors on the variable.

Placebo Effect

The placebo effect occurs when a subject reacts favorably to a placebo when in fact the subject has been given a fake treatment. To help control or minimize the placebo effect, a technique called blinding can be used.
Blinding is a technique in which the subjects do not know whether they are receiving a treatment or a placebo. In a double-blind experiment, neither the experimenter nor the subjects know whether the subjects are receiving a treatment or a placebo. The experimenter is informed after all the data have been collected. This type of experimental design is preferred by researchers.

Randomization

One challenge for experimenters is assigning subjects to groups so the groups have similar characteristics (such as age, height, weight, and so on). When treatment and control groups are similar, experimenters can conclude that any differences between groups are due to the treatment. To form groups with similar characteristics, experimenters use randomization.
Randomization is a process of randomly assigning subjects to different treatment groups.

Randomization

In a completely randomized design, subjects are assigned to different treatment groups through random selection. In some experiments, it may be necessary for the experimenter to use blocks, which are groups of subjects with similar characteristics. A commonly used experimental design is a randomized block design. To use a randomized block design, the experimenter divides the subjects with similar characteristics into blocks, and then, within each block, randomly assign subjects to treatment groups.
Another type of experimental design is a matched-pairs design, in which subjects are paired up according to a similarity.

Randomization

Another type of experimental design is a matched-pairs design, in which subjects are paired up according to a similarity.
Sample size, which is the number of subjects in a study, is another important part of experimental design. To improve the validity of experimental results, replication is required.
Replication is the repetition of an experiment under the same or similar conditions.

Example 2 a. | Analyzing an Experimental Design

A company wants to test the effectiveness of a new gum developed to help people quit smoking. Identify a potential problem with each experimental design and suggest a way to improve it.
- The company identifies ten adults who are heavy smokers. Five of the subjects are given the new gum and the other five subjects are given a placebo. After two months, the subjects are evaluated and it is found that the five subjects using the new gum have quit smoking.

Example 2 b. | Analyzing an Experimental Design

A company wants to test the effectiveness of a new gum developed to help people quit smoking. Identify a potential problem with each experimental design and suggest a way to improve it.
- The company identifies 1000 adults who are heavy smokers. The subjects are divided into blocks according to gender. Females are given the new gum and males are given the placebo. After two months, a significant number of the female subjects have quit smoking.

Example 2 | Solution

a. The sample size being used is not large enough to validate the results of the experiment. The experiment must be replicated to improve the validity.
b. The groups are not similar. The new gum may have a greater effect on women than on men, or vice versa. The subjects can be divided into blocks according to gender, but then, within each block, they should be randomly assigned to be in the treatment group or in the control group.

Sampling Techniques

A census is a count or measure of an entire population. Taking a census provides complete information, but it is often costly and difficult to perform.
A sampling is a count or measure of part of a population and is more commonly used in statistical studies.
To collect unbiased data, a researcher must ensure that the sample is representative of the population.

Sampling Techniques

Even with the best methods of sampling, a sampling error may occur.
A sampling error is the difference between the results of a sample and those of the population.
A random sample is one in which every member of the population has an equal chance of being selected.
A simple random sample is a sample in which every possible sample of the same size has the same chance of being selected.

Sampling Techniques

Stratified Sample When it is important for the sample to have members from each segment of the population, you should use a stratified sample.
Depending on the focus of the study, members of the population are divided into two or more subsets, called strata, that share a similar characteristic such as age, gender, ethnicity, or even political preference.
A sample is then randomly selected from each of the strata. Using a stratified sample ensures that each segment of the population is represented.

Sampling Techniques

Cluster Sample When the population falls into naturally occurring subgroups, each having similar characteristics, a cluster sample may be the most appropriate.
To select a cluster sample, divide the population into groups, called clusters, and select all of the members in one or more (but not all) of the clusters.
Examples of clusters could be different sections of the same course or different branches of a bank.

Sampling Techniques

Systematic Sample A systematic sample is a sample in which each member of the population is assigned a number.
The members of the population are ordered in some way, a starting number is randomly selected, and then sample members are selected at regular intervals from the starting number. (For instance, every 3rd, 5th, or 100th member is selected.)

Sampling Techniques

A type of sample that often leads to biased studies (so it is not recommended) is a convenience sample.
A convenience sample consists only of members of the population that are easy to access.

Example 3 | Identifying Sampling Techniques

You are doing a study to determine the opinions of students at your school regarding stem cell research. Identify the sampling technique you are using when you select the samples listed. Discuss potential sources of bias (if any).
- You divide the student population with respect to majors and randomly select and question some students in each major.
- You assign each student a number and generate random numbers. You then question each student whose number is randomly selected.
- You select students who are in your biology class.

Example 3 | Solution

Because students are divided into strata (majors) and a sample is selected from each major, this is a stratified sample.
Each sample of the same size has an equal chance of being selected and each student has an equal chance of being selected, so this is a simple random sample.
Because the sample is taken from students who are readily available, this is a convenience sample. The sample may be biased because biology students may be more familiar with stem cell research than other students and may have stronger opinions.

Try It Yourself 2 |

You want to determine the opinions of students regarding stem cell research. Identify the sampling technique you are using when you select these samples.
- You select a class at random and question each student in the class.
- You assign each student a number and, after choosing a starting number, question every 25th student.

Solution 2 | Try It Yourself

The sample was selected by using the students in a randomly chosen class. This is cluster sampling.
The sample was selected by numbering each student in the school, randomly choosing a starting number, and selecting students at regular intervals from the starting number. This is systematic sampling.

Using Technology in Statistics |

With large data sets, you will find that calculators or computer software programs can help perform calculations and create graphics. We will perform these calculations using Rstudio and Python.

Example 1 | Generating a List of Random Numbers

A quality control department inspects a random sample of 15 of the 167 cars that are assembled at an auto plant. How should the cars be chosen?

Generating a List of Random Numbers | Solution Using Rstudio

code

library(tidyverse)
library(tictoc)
library(reticulate)
#install_python()
#install_miniconda()
#conda_list()
#conda_create(
#  envname = "reptilia",
#  packages = c("pandas", "pyarrow")
#)
use_miniconda("reptilia")

# To generate integers WITHOUT replacement:
sample(1:167, 15, replace=FALSE)

   [1]  28 122  80 130  66 131 128  20 134  21   1  44 140 113   8

Generating a List of Random Numbers | Solution Using Python

code

import sys
import numpy as np
#print(sys.version)
#print(sys.executable)
import pandas as pd

# (A list of 15 random numbers between 1 and 167)
import random
random_numbers = [random.randint(1, 167) for _ in range(15)]
print(random_numbers)

[134, 74, 94, 53, 124, 61, 131, 64, 60, 39, 21, 60, 75, 112, 128]

2 Descriptive Statistics

2.1 Frequency Distributions and Their Graphs

What You Should Learn

How to construct a frequency distribution, including limits, midpoints, relative frequencies, cumulative frequencies, and boundaries; and using technology (with R and Python)
How to construct frequency histograms, frequency polygons, relative frequency histograms, and ogives; and using technology (with R and Python)

Frequency Distributions

Important characteristics to look for when organizing and describing a data set are its center, its variability (or spread), and its shape.
In this section, you will learn how to organize data sets by grouping the data into intervals called classes and forming a frequency distribution.

Classes, Frequency

A frequency distribution is a table that shows classes or intervals of data entries with a count of the number of entries in each class.
The frequency f of a class is the number of data entries in the class.

Limits, Width, Range

Lower class limit is the least number that can belong to the class.
Upper class limit is the greatest number that can belong to the class.
The class width is the distance between lower (or upper) limits of consecutive classes.
The difference between the maximum and minimum data entries is called the range.

Utility Function | R

summary() - main function for collecting simple descriptive statistics.
str() or glimpse() - provide a synthetic representation of information on a data frame, like its size, column names, types, and values of the first elements.
head() or tail() - allow visualizing the few topmost (head) or bottom most (tail) rows of a command output.
View() or view() - visualize a data frame.
unique() - It returns the list of unique values in a series. Particularly useful when applied to columns as unique(df$col_name).

Utility Functions | Python

.describe() - main function is to obtain descriptive statistics.
.info() provides particularly useful information like the size of the data frame, and for each column its name, the type, and the number of non-null values.
.head() or .tail() - allow visualizing the few topmost (head) or bottom most (tail) rows of a command output.
.unique() - returns a list of unique values in a series. Particularly useful when applied to columns as df['col_name'].unique()

Utility Function | R

names() - returns column names of a data frame with names(df) and variable names with lists.
class() - returns the data type of an R object, like numeric, character, logical, and data frame.
length() - returns the length of an R object.
nrow() or ncol() - return, respectively, the number of rows and columns in a data frame.

Utility Functions | Python

.columns and .index - return, respectively, the list of column names and the list of names of the row index.
.dtypes - returns the list of columns with the corresponding data type.
.size returns the length of a Python object.
.shape returns the number of rows and columns of a data frame.

Example 1 | Constructing a Frequency Distribution from a Data Set

The data set lists the cell phone screen times (in minutes) for 30 U.S. adults on a recent day. Construct a frequency distribution that has seven classes.

\[\begin{align} \begin{matrix} 200 & 239 & 155 & 252 & 384 & 165 & 296 & 405 & 303 & 400\\ 307 & 241 & 256 & 315 & 330 & 317 & 352 & 266 & 276 & 345\\ 238 & 306 & 290 & 271 & 345 & 312 & 293 & 195 & 168 & 342 \end{matrix} \end{align}\]

Example 2 | Solution R

R Display

code

library(tidyverse)
library(tictoc)
library(reticulate)
use_miniconda("reptilia")
sData = c(200, 239, 155, 252, 384, 165, 296, 405, 303, 400, 307, 241, 256, 315, 330, 317, 352, 266, 276, 345, 238, 306, 290, 271, 345, 312, 293, 195, 168, 342)


sData_frm = data.frame(sData)
#View(sData_frm)
attach(sData_frm)

# create some new variables

low_lim = 154.5
up_lim = 406.5
data_range = up_lim - low_lim
class = 7
class_width = (data_range/class)
x_breaks = seq(low_lim, up_lim, class_width)


# midpoint
x_mid <- seq(low_lim + class_width/2,
up_lim - class_width/2, class_width) 
x_midpoint = round(x_mid, digits = 1)
#view(x_midpoint) # look at those values

#Now we are ready to find out into #which interval each
#of the values in our original data
c_bound = cut(sData, breaks = x_breaks)
y = table(c_bound)
df = data.frame(y)
df$midpnt = x_mid
rf = (df$Freq)/sum(df$Freq)
df$rf = round(rf, 2)
df$percent = round(rf * 100, 2)
cs = cumsum(df$Freq)
df$cumul = cs
n = length(sData_frm)
rcf = round(cumsum(rf), 2)
df$rel_cumul = rcf
# append rel cumul sum
df$pie = round(360*rf, 1) # append degrees in pie chart

knitr::kable(df, format = "pandoc", caption = "Table 1-1: Constructing a Frequency Distribution from a Data Set")

Table 1-1: Constructing a Frequency Distribution from a Data Set
c_bound	Freq	midpnt	rf	percent	cumul	rel_cumul	pie
(154,190]	3	172.5	0.10	10.00	3	0.10	36
(190,226]	2	208.5	0.07	6.67	5	0.17	24
(226,262]	5	244.5	0.17	16.67	10	0.33	60
(262,298]	6	280.5	0.20	20.00	16	0.53	72
(298,334]	7	316.5	0.23	23.33	23	0.77	84
(334,370]	4	352.5	0.13	13.33	27	0.90	48
(370,406]	3	388.5	0.10	10.00	30	1.00	36

Example 2 | Solution Python

Python Display

code

import sys
import numpy as np
import pandas as pd
import great_tables
from great_tables import GT, md

#sDatap = pd.read_csv("psfords/data/state.csv")

# define your series and bins
bins = np.linspace(154.5, 406.5, 8) 

data = pd.Series([200, 239, 155, 252, 384, 165, 296, 405, 303, 400, 307, 241, 256, 315, 330, 317, 352, 266, 276, 345, 238, 306, 290, 271, 345, 312, 293, 195, 168, 342])



# use np.histogram method
hist, bin_edges = np.histogram(data, bins=bins)

# create a data frame to better display the results
result_df = pd.DataFrame({'bin_lower': bin_edges[:-1],                          'bin_upper': bin_edges[1:], 
                          'frequency': hist})

# computing percentage
result_df['percent (%)'] = round( 100*(result_df.frequency/result_df.frequency.sum()),2)

# computing relative frequency

result_df['rel_frequency'] = round( (result_df.frequency/result_df.frequency.sum()),2)

# cumulative frequency
result_df['cum_frequency'] = result_df.frequency.cumsum()

# cumulative percent
result_df['cum_persent (%)'] = round( 100*(result_df.frequency/result_df.frequency.sum()), 2)

#computing pie
result_df['pie chart angle'] = round( 360*(result_df.frequency/result_df.frequency.sum()),1)


# display result
my_table = GT(result_df)
my_table = (
   GT(result_df)
   .tab_header(
      title="Table 1-1 Constructing a Frequency Distribution from a Data Set",
      subtitle="",
   )
   .tab_source_note(md("**Source code**: the `great_tables` python library")))
my_table

Table 1-1 Constructing a Frequency Distribution from a Data Set

bin_lower	bin_upper	frequency	percent (%)	rel_frequency	cum_frequency	cum_persent (%)	pie chart angle
154.5	190.5	3	10.0	0.1	3	10.0	36.0
190.5	226.5	2	6.67	0.07	5	6.67	24.0
226.5	262.5	5	16.67	0.17	10	16.67	60.0
262.5	298.5	6	20.0	0.2	16	20.0	72.0
298.5	334.5	7	23.33	0.23	23	23.33	84.0
334.5	370.5	4	13.33	0.13	27	13.33	48.0
370.5	406.5	3	10.0	0.1	30	10.0	36.0
Source code: the `great_tables` python library

Graphs of Frequency Distributions | Frequency Histogram

A frequency histogram uses bars to represent the frequency distribution of a data set. A histogram has the following properties.
- The horizontal scale is quantitative and measures the data entries.
- The vertical scale measures the frequencies of the classes.
- Consecutive bars must touch.
- Because consecutive bars of a histogram must touch, bars must begin and end at class boundaries instead of class limits.
- Class boundaries are the numbers that separate classes without forming gaps between them.

Example 3 | Constructing a Frequency Histogram

Draw a frequency histogram for the frequency distribution in Example 2. Describe any patterns.

Example 3 | Solution R

Histogram

code

library(tidyverse)
library(tictoc)
library(reticulate)
use_miniconda("reptilia")
sData = c(200, 239, 155, 252, 384, 165, 296, 405, 303, 400, 307, 241, 256, 315, 330, 317, 352, 266, 276, 345, 238, 306, 290, 271, 345, 312, 293, 195, 168, 342)


sData_frm = data.frame(sData)
# ggplot2:
ggplot(sData_frm, aes(sData)) +
geom_histogram(aes(fill = after_stat(count)), bins = 7, binwidth = 40) + scale_x_continuous(name = "Phone numbers") + scale_y_continuous(name = "Frequency") +
labs(title = "Frequency histogram of phone numbers", xlab = "Numbers")

Example 3 | Solution Python

Histogram

code

import sys
import numpy as np
import pandas as pd
import altair as alt

data2 = pd.Series([200, 239, 155, 252, 384, 165, 296, 405, 303, 400, 307, 241, 256, 315, 330, 317, 352, 266, 276, 345, 238, 306, 290, 271, 345, 312, 293, 195, 168, 342])
# Create the pandas DataFrame
df2 = pd.DataFrame(data2, columns=['phone'])

# plot histogram 

phone_hist = alt.Chart(df2).mark_bar().encode(
x=alt.X("phone", bin=alt.Bin(maxbins=8)),
y="count()"
)

phone_hist

The Basics | R

After working with the material in this chapter, you will be able to:

Create reusable R scripts,
Store data in R,
Use functions in R to analyse data,
Install add-on packages adding more features to R,
Compute descriptive statistics like the mean and the median, including for subgroups,
Do mathematical calculations,
Create nice-looking plots, including scatterplots, boxplots, histograms and bar charts,

code

library(tidyverse)
library(tictoc)
library(reticulate)
use_miniconda("reptilia")

The Basics | R

Distinguish between diﬀerent data types,
Import data from Excel spreadsheets and csv text files,
Add new variables to your data,
Modify variables in your data,
Remove variables from your data,
Save and export your data,
Work with RStudio projects,
Use |> pipes to chain functions together, and
Find errors in your code.

code

import sys
import numpy as np
import pandas as pd
import altair as alt

Do It Yourself | Rmarkdown Scripts

To create a new Rmarkdown script $\color{brown}{\textit{select File > New File > R Markdown}}$

code

library(tidyverse)
library(tictoc)
library(reticulate)
use_miniconda("reptilia")

1 + 1

  [1] 2

code

2 * 2

  [1] 4

code

1 + 2 * 3 - 5

  [1] 2

code

(1 + 2) * 3 - 5

  [1] 4

Variables and functions | Storing data

Without data, there is no data analytics.
A variable is a name used to store data, so that we can refer to a dataset when we write code.

code

library(tidyverse)
library(tictoc)
library(reticulate)
use_miniconda("reptilia")

x <- 4
x + 1

  [1] 5

code

x + x

  [1] 8

Variables and functions | Storing data

In some cases, you may want to switch the direction of the arrow, so that the variable name is on the right-hand side.
This is called right-assignment and works just fine too:

code

library(tidyverse)
library(tictoc)
library(reticulate)
use_miniconda("reptilia")

2 + 2 -> y
y + 3

  [1] 7

code

y * 5

  [1] 20

Do It Yourself | Storing data

$\color{black}{\textbf{Exercise 2.1}}$ Do the following using R:

Compute the sum 924 + 124 and assign the result to a variable named a.
Compute a * a

code

library(tidyverse)
library(tictoc)
library(reticulate)
use_miniconda("reptilia")

924 + 124 -> a
a

  [1] 1048

code

a * a

  [1] 1098304

Variables and functions | What’s in a name?

For the sake of readability, it is often preferable to give your variables more informative names.
There are a few diﬀerent naming conventions that can be used to name your variables:
- snake_case: where words are separated by an underscore (_). Example: house_hold_net_income.
- camelCase or CamelCase: where each new word starts with a capital letter. Example: householdNetIncome or HouseholdNetIncome.
- period.case: where each word is separated by a period (.).

Example 1 | What’s in a name?

This lovely little code snippet can be used to compute your net income.

code

library(tidyverse)
library(tictoc)
library(reticulate)
use_miniconda("reptilia")

# Set income and taxes:
income <- 100 # replace 100 with your income
taxes <- 20 # replace 20 with how much taxes you pay

# Compute your net income
net_income <- income - taxes
# Voilà
net_income

  [1] 80

Exercise 2.2 | | Do It Yourself

What happens if you use an invalid character in a variable name? Try e.g., the following:
- $\color{blue}{\text{net income <- income - taxes}}$
- $\color{blue}{\text{net-income <- income - taxes}}$
- $\color{blue}{\text{ca\$h <- income - taxes}}$
What happens if you put R code as a comment? For example,
- $\color{blue}{\text{income <- 100}}$
- $\color{blue}{\text{taxes <- 20}}$
- $\color{blue}{\text{net_income <- income - taxes}}$
- $\color{blue}{\text{# gross_income <- net_income + taxes}}$

code

library(tidyverse)
library(tictoc)
library(reticulate)
use_miniconda("reptilia")

Exercise 2.2 | | Do It Yourself

What happens if you remove a line break and replace it by a semicolon ;? For example,
- $\color{blue}{\text{income <- 200; taxes <- 30}}$
What happens if you do two assignments on the same line? For example,
- $\color{blue}{\text{income2 <- taxes2 <- 100}}$

code

library(tidyverse)
library(tictoc)
library(reticulate)
use_miniconda("reptilia")

Variables and functions | Vectors and data frames

We can create a vector using the following code, where c stands for combine:

code

library(tidyverse)
library(tictoc)
library(reticulate)
use_miniconda("reptilia")

age <- c(28, 48, 47, 71, 22, 80, 48, 30, 31)
purchase <- c(20, 59, 2, 12, 22, 160, 34, 34, 29)

Variables and functions | Vectors and data frames

In R, tables of vectors are called data frames. We can combine the two vectors into a data frame as follows:

code

library(tidyverse)
library(tictoc)
library(reticulate)
use_miniconda("reptilia")

# create a data frame called bookstore using age and purchase vectors.

bookstore <- data.frame(age, purchase)
bookstore

    age purchase
  1  28       20
  2  48       59
  3  47        2
  4  71       12
  5  22       22
  6  80      160
  7  48       34
  8  30       34
  9  31       29