mlbSal = read.csv("https://www.lock5stat.com/datasets3e/BaseballSalaries2019.csv")TASK: Your own personal log of commands
(Delete this opening from your final document.) Generate for yourself examples of commands you’ve seen, ones which perform various useful tasks. The first few pages provide examples of examples done in R. On the final page are 10 exercises to which you should respond by generating examples with adequate explanations as appropriate—as if the audience is yourself several weeks or months from now. This is a quiz, but you can find examples adequate to your needs in the daily class notes.
The product of this work is the Quarto (.qdf) document itself, along with the .pdf obtained from knitting it. Both should be handed in, but you should also keep them for later reference, perhaps also expanding it as you learn more commands in the upcoming weeks. Upon rendering the .qmd, both files will be in your file space on the server. You’ll need to download them to a local computer, and then upload them to MOM as your homework submission.
Basic operations on data frames
If you have the direct url to a data set in .csv format, it can be imported using the url in read.csv():
It is possible to read into R any .csv file using this command, not merely those delivered via a web address. If, for instance, you create a .csv file using a spreadsheet program like Microsoft Excel, you can read the data into R so long as the file can be accessed in your workspace.
For data frames, either those imported as above, or ones obtained in other ways (through the loading of a package, for instance), you can learn its basic structure through commands like nrow() (tells how many rows/cases), names() (tells the names of columns/variables), or head() (to view the first few cases).
nrow(mlbSal)[1] 877
names(mlbSal)[1] "Name" "Salary" "Team" "POS"
head(mlbSal, n=3) # specify the number of cases with 'n=_' Name Salary Team POS
1 Max Scherzer 42.143 WSH SP
2 Stephen Strasburg 36.429 WSH SP
3 Mike Trout 34.083 LAA CF
Building a data frame within R: data.frame()
Some problems are stated with their own small data sets, not easily found as .csv files. The data in Exercise 1.20, p. 15 of the Lock5 text is such a problem. There are two variables measured on 8 cases. If there is reason to make the data available in R, it might be easiest to build it with lines like these:
ex1.20data = data.frame(
sex = c("Male", "Male", "Male", "Male", "Male", rep("Female", 3)),
time = c(40, 87, 78, 106, 67, 70, 153, 81)
)
ex1.20data sex time
1 Male 40
2 Male 87
3 Male 78
4 Male 106
5 Male 67
6 Female 70
7 Female 153
8 Female 81
You can access the time values using the dFrameName$colName notation:
ex1.20data$time[1] 40 87 78 106 67 70 153 81
Selecting cases from data frame: subset()
You can select out cases in a data frame based on specified criteria. It’s not the same as choosing only to see the values in a certain column. Rather, we want to see full case information, but perhaps only ones applying to “Female” rowers, or only those who made the Atlantic transit in between 70 and 100 days.
subset(ex1.20data, sex=="Female") # Note the double-equal signs sex time
6 Female 70
7 Female 153
8 Female 81
subset(ex1.20data, time <= 100 & time >= 70) sex time
2 Male 87
3 Male 78
6 Female 70
8 Female 81
View the distribution from a named family: gf_dist()
We have encountered two types of random variables considered important enough to specify them by names: binomial and normal. There are many different instances within these families. The Norm(100,15) distribution is the one typically used as a model for how different values of IQ appear in a population. It can be displayed using
gf_dist("norm", mean=100, sd=15)Here, using a pipe, I overlay a second normal distribution onto a first one:
gf_dist("norm", mean=0, sd=2, color="red") |>
gf_dist("norm", mean=1, sd=3, color="blue")To display a binomial distribution, Binom(10, 0.2) for instance:
gf_dist("binom", size=10, prob=0.2)When \(X \sim\) Binom(\(n, p\)), we know (from Chapter 11) that the mean and standard deviation of \(X\) are \(\mu_X = np\) and \(\sigma_X = \sqrt{np(1-p)}\). The specific r.v. \(X\sim\) Binom(100, 0.6) will have \(\mu_X=(100)(0.6)=60\) and \(\sigma_X=\sqrt{100(0.6)(0.4)} \doteq 4.90\). It is interesting to compare the distributions Binom(100, 0.6) and Norm(60, 4.9):
gf_dist("binom", size=100, prob=0.6) |>
gf_dist("norm", mean=60, sd=4.9, color="red")You take it from here. In this order, provide instructions to yourself for doing these things in R:
- Building a frequency table (univariate data) from raw data, as well as how to display it as a bar graph.
- Building a two-way table from raw data, much as Table 2.5 was built using the data in StudentSurvey, a data frame from the Lock5withR package.
- How to draw both an SRS and an iid sample from a collection of values.
- How to produce quantiles-to-order; that is, if it is desired to learn the 5th, 23rd, and 81st percentile in a sample of values, how to get these efficiently.
- How to generate a scatterplot for bivariate quantitative data.
- How to produce side-by-side boxplots, perhaps in the case of
Sepal.Length, giving one boxplot perSpecies(iris data set). - How to produce a histogram with bins that are of a specified width.
- R commands that can be used to produce values such as those found in the standard normal table at this link https://math.arizona.edu/~jwatkins/normal-table.pdf. For instance, the linked table indicates the cumulative probability up to \(Z=-0.92\) is \(0.1788\). On the other hand, if you wish to know the value of \(Z\) at which the cumulative probability is \(0.6368\), the table shows this occurs at \(Z=0.35\).
- How to use
dbinom()andpbinom(), and how to understand their results. - How to calculate the correlation coefficient for bivariate data.
How to build a frequency table using raw data
The first step is to enter the data into the computer using lists. There are two ways this can be done, one for quantitative data and one for categorical data. make sure to name your list!
Categorical command
FavFood = c("Pizza","Bread","Pizza","Chicken","Chicken","Bread","Bread","Pizza")Quantitative command
DiceRoll = c(1,4,3,6,5,3,2,4)-How to display data-
The best way to display the data is to create a graph. This can be accomplished using the gf commands. However, first, it is essential to create a data frame. A data frame combines raw data and makes it easier to make graphs. To combine the two lists we created: FavFood and DiceRoll, we use this command:
ClassData=data.frame(FavFood,DiceRoll)Make sure you name the data frame, or else it cant be found later!
To see this data, and how the computer organized it, you can type the name of the data frame and hit enter.
ClassData FavFood DiceRoll
1 Pizza 1
2 Bread 4
3 Pizza 3
4 Chicken 6
5 Chicken 5
6 Bread 3
7 Bread 2
8 Pizza 4
Sometimes the data frame is too large, so you can use the head command to display the first six rows of data.
head(ClassData) FavFood DiceRoll
1 Pizza 1
2 Bread 4
3 Pizza 3
4 Chicken 6
5 Chicken 5
6 Bread 3
-How to create a bar chart-
Bar charts are great at displaying categorical variables. The gf_bar command takes categorical data in a data frame and displays it as a bar chart.
gf_bar(~FavFood, data=ClassData)Building a two-way table
If a refresher is needed on how to make a data frame, then review the section above.
When building a two-way table, it is essential to consider the variables being compared. Are the two variables being compared both categorical, both quantitative, or one quantitative and one categorical?
-Two categorical vairables-
When working with two categorical variables, such as species of penguins and islands in the dataset penguins, use the bar chart command, but with a twist. If you add a “|” between two variables, the computer compares them.
gf_bar(~species|island, data = penguins)However, be careful about the order in which you place the variables. In this scenario, it created three graphs, one for each island, showing the number of each species on that island.
If, however, you switched the variables, you would get three new graphs, each representing a species, with the number of penguins on each island of that species shown.
gf_bar(~island|species, data = penguins)Essentially, be cautious about where the variables are located.
-Two quantitative variables-
If the goal is to compare two quantitative variables, then we use a Scatter plot
This is a graph that allows us to visualize one variable on the x-axis and another on the y-axis.
Be careful when inputting the command, as the first variable input is going to be placed on the y axis, and the second on the x axis
gf_point(bill_len~bill_dep, data = penguins)Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
-One quantitative and one categorical vairable-
For one quantitative and one categorical variable, the goal is to create a histogram. This can be done with the gf_histogram command.
gf_histogram(~body_mass | species, data = penguins)Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
In this instance, the categorical data is being used to classify which species the penguin belonged to, and the quantitative data is used to show the distribution of body mass in each species.
:)
how to take a SRS and an IID from a collection of values
-SRS-
SRS stands for Simple Random Sample and is a method of sampling without replacement. Meaning that if you were to roll a dice and get a one, the next time you rolled the dice in that sample, you could not get another one.
First, I am going to create a list of possible values for a dice using the list command
die = c(1,2,3,4,5,6)Then, I will use the sample command to take a simple random sample six times.
sample(die, size=6)[1] 3 1 2 5 4 6
Now I have a set of six random numbers from the list of numbers.
This is great if the desire is to have no repetition in samples. However, if I were to try the same command and ask for eight samples, the program would still work because it is sampling without replacement.
-IID-
IID stands for Independent and identically distributed. This is a fancy way of saying that the program will sample with replacement This means that numbers produced using this type of sample can repeat.
going back to the dice data, to run an IID sample, use the resample command
resample(die, size=6)[1] 4 1 4 5 1 4
Now we have a sample with replacement, so there are numbers that repeat, unlike in the SRS sample. This type of sample is proper when generating bootstrap distributions, or when more samples are needed than values are available in the list.
How to produce quantiles-to-order
to produce quantiles-to-order, use the qnorm command. This command enables us to view, on a normal distribution, the value that lies at a specific percentile. In this example, I am using a normal distribution with mean = 0, SE = 1, and am finding what number has all previous values added up to it, being 30% of all the data.
qnorm(0.3,0,1)[1] -0.5244005
This result signifies that from negative infinity to -0.5244005, 30% of all the data is held.
This command can be adjusted to fit any normal distribution by changing the standard deviation and standard error in the command
How to produce side-by-side boxplots
I couldn’t find the command in my notes, but I do know that you use gf_box to create a box plot.
How to produce a histogram with bins that are of a specified width.
To modify the bins of a histogram, you have to alter the histogram command, tweeking an extra piece slightly.
gf_histogram(~bill_len, data = penguins, bins = 50)Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
so now instead of the different 25 bins, you have 50!
How to find Z values
Z values are standardized, meaning they fit into a graph with a mean of 1 and a standard error of 1.
The Z value represents how many standard deviations away a value is from the mean, and it can be found in a couple of different ways.
-from an X value-
If you have an x value on a standard normal distribution and you want to find the z value, you use the xpnorm command to see the value.
For example, say you have a distribution with a mean of 4 and a standard error of 0.5, and want to find the Z value for the point X = 6.5. The command would look like this.
xpnorm(5.75,4,0.5)
If X ~ N(4, 0.5), then
P(X <= 5.75) = P(Z <= 3.5) = 0.9998
P(X > 5.75) = P(Z > 3.5) = 0.0002326
[1] 0.9997674
You can see on the graph that the Z-value is 3.5.
-From a proportion-
When starting from a proportion, we use the xqnorm command. Let’s say we have the same scenario as last time, with a mean of 4 and a standard error of 0.5. Now, I want to find the z-value when p = 0.3, or the point where 30% of the data is held.
xqnorm(0.3,4,0.5)
If X ~ N(4, 0.5), then
P(X <= 3.7378) = 0.3
P(X > 3.7378) = 0.7
[1] 3.7378
As you can see on the graph, the z-value is -0.52
How to use dbinom() and pbinom(), and how to understand their results.
-dbinom-
The dbinom command is related to the probability of getting exactly X successes out of N attempts, where the probability of success is P. The first number in the command is the amount of success desired, the second number is the number of trials, and the third is the probability of every trial being a success.
dbinom(50,100,0.5)[1] 0.07958924
The number 0.07958924 means that we have a 7.958924% chance of, in 100 trials, where the probability of success is 0.5, to get exactly 50 successes.
-pbinom-
The pbinom command differs in that it calculates the probability of obtaining at most X successes in Z trials with a probability of success being P. Let’s use the same criteria as last time. 50 is the desired number of successes, 100 is the number of trials, and 0.5 is the probability of success.
pbinom(50,100,0.5)[1] 0.5397946
This result, 0.5397946, means that there is a 53.97946% chance that in a situation of 100 trials, with the probability of success being 0.5, there will be at most 50 successes.
How to calculate the correlation coefficient for bivariate data.
The correlation coefficient comparing two sets of quantitative data describes the linear association between them. It ranges from one to negative one. If there is a strong negative association between the variables, the correlation coefficient will be close to -1. If there is a strong positive association between the two variables, the correlation coefficient will be close to one.
as an example, lets use test scores and hours studied.
TestScores = c(75,70,80,81,78,88,90,85,99,100)
HoursStuddied = c(1,1,2,4,3,5,6,5,7,9)
ClassStuff = data.frame(TestScores, HoursStuddied)
gf_point(HoursStuddied~TestScores, data=ClassStuff)As you can see in the graph, there appears to be a positive association. and if we use the correlation coefficient command:
cor(HoursStuddied~TestScores, data=ClassStuff)[1] 0.9611441
We obtain a result of 0.9611441, which is very close to 1, affirming that there is a strong positive association.
I am done! Woo Woo!