Name _______________________________________, Date ____________
Put your answers into your one document to rule them all. For things hand drawn, you’ll need to link to or upload a photo. There are about 10 questions, with 1 table to fill in. You can use calculators, spreadsheets, whatever. If you use a google spreadsheet, you might be able to embed it into your one-doc-to-rule. Post to #stats channel
Don’t round at every step. Round only at the end. I’m assumming you will be using some spreadsheet to do you work. If not and you are using a calculator, then be sure to keep 4 decimal positions.
So, if you get 3.50682, keep 3.5068 in your calculations. At the very last step of writing down your final answer, then round it to just 1 decimal point: 3.5.
The reason is that if you round often, then the final answer will be vastly different from the correct one.
The purpose of this assignment is to practice calculating measures of central tendencies, variance, standard deviations, as well as some simple graphing. You will work on this independently and then as a group check your work.
Two years ago I collected some data in a small research method class and I am reproducing it here for you now.
Essentially, I was curious if there was any relationship between the number of sources a student would use in their research proposals relative to the number of years of college that have.
Notice that I’ve given you the data twice. The first is sorted least to greatest by YEARS OF COLLEGE, and the 2nd presentation has it sorted by NUMBER OF SOURCES. This should help a bit with the calculations that follow.
| Years of college | # of sources used |
|---|---|
| 0.25 | 7 |
| 1.50 | 4 |
| 1.50 | 4 |
| 1.50 | 3 |
| 1.50 | NA |
| 2.00 | 3 |
| 2.00 | 6 |
| 2.25 | 2 |
| 2.50 | 2 |
| 3.00 | 4 |
| 3.00 | 4 |
| 3.00 | 2 |
| 3.50 | 3 |
| 5.00 | 14 |
| 5.00 | 1 |
| 5.25 | 14 |
| Years of college | # of sources used |
|---|---|
| 5.00 | 1 |
| 2.25 | 2 |
| 2.50 | 2 |
| 3.00 | 2 |
| 1.50 | 3 |
| 2.00 | 3 |
| 3.50 | 3 |
| 1.50 | 4 |
| 1.50 | 4 |
| 3.00 | 4 |
| 3.00 | 4 |
| 2.00 | 6 |
| 0.25 | 7 |
| 5.00 | 14 |
| 5.25 | 14 |
| 1.50 | NA |
Recall that to Calculate the mean, you sum all of the scores and then divide by the number of scores. It is symbolically presented as \(\bar{x} = \Sigma x_i/_n\) where \(n\) is the number of scores. (By the way, the “\(i\)” indicates a particular record. So, \(x_6\) would be the 6th record.)
Yrs college:_______________ Sources Used:_______________
To calculate the median, you must first sort the scores from least to greatest (as a reminder I’ve done that for you above). If there are an odd number of scores, pick the middle score. If there are an even number of scores, first find the two middle scores, add them together and divide by two. To find the middle number of an ordered data set you would: \(\frac{(n+1)}{2}\). So as an example: \({5,6,2,9,12}\), \(n=5\), so to find the middle score, add 1 to 5 and then divide in half: \(6/2 = 3\), which, after sorting: \(x_3 = 6\) .
Yrs college:_______________ Sources Used:_______________
To calculate the Mode, you select the most common score. With a small data set like this, you can just count them.
Yrs college:_______________ Sources Used:_______________
We haven’t really discussed how to calculate the way a data set varies, which is also called spread or dispersion. The most common ways to name it is either the ‘Variance’ or ‘Standard Deviation’. These are basically the same thing, Standard deviations is a standardized variance. The concept tries to capture how far a score, or a set of scores, deviates from the mean of the group.
As a simple example, think about your commute to or from school before any pandemics or civil unrest. Living in Shoreline, my commute varies. Sometimes it’s quick, usually when the weather is clear and during a time of the week where there was little traffic. (Notice that I’m hypothesizing some factors that influence my commute time: weather and traffic). Other times it’s a slog. Maybe the range of my commute is between 10 and 30 minutes.
Though knowing the range of scores is useful, we often like to think of an average amount of variation. If I know that on average my commute is 16 minutes, I can estimate how much padding I should add to that figure to account for me getting to school on time. If I have a lot of variation in my commutes to school, I need to add a large pad–say and additional 15 minutes should be added to my average commute time. But if my range of commutes is between 10 and 12 minutes, my padding might be only 2 minutes.
To understand variance, the core concept is ‘difference’. Something ( a score) is different from something else (the mean), and so the basic idea is that we subtract the score from the mean.
Here is an example from class:
data<-c(8,5,3,2,2)
print(as.data.frame(data))
## data
## 1 8
## 2 5
## 3 3
## 4 2
## 5 2
Its mean (\(\bar{x}\)) is the sum of those values divided by the number of values (\(\Sigma x/n\)), which in this case is 20/5, or 4.
## data mean dif.from.mean
## 1 8 4 4
## 2 5 4 1
## 3 3 4 -1
## 4 2 4 -2
## 5 2 4 -2
That last column might be able to give you some intuition of how much variation there is.The original data is sorted so you can kind of see that how the average is the anchor. If these were commute times, and you had to guess how much time it should take you get to school, what is a good guess? I’m suggesting that the average is a good guess, but because of the variation, you may want to add 3 or 4 minutes.
Conceptually, we are looking to understand the average amount of variation. So your instinct may be to simply calculate the average of this last column: sum the scores and divide by \(n\). Your instinct would be fine, except if you sum these numbers, you should get zero. So one technique would be to take the absolute value (dropping the negative) of these 3rd column scores, and it would give you some intuition of an average amount of variation around the mean. You should see that adding the absolute values is 10, and dividing by 5 you’d see a score of 2.
# read the following code from inside out. So, first there
# is a difference from mean, then it's the absolute value (abs)
# each of which is then divided by the number of rows in the dataframe
# finally, all of those absolute differences are summed.
sum(abs(example_df$dif.from.mean)/nrow(example_df))
So, in this small little data set, if we calculate the variance by using the “absolute difference method” we would get a variance of 2.
It turns out for mathematical reasons, we don’t do this. I’m showing you it because it for some it’s more intuitive to begin understanding what variance means. Instead, we do something else that can be a little upseting: we square the differences, and doing so has many benefits to other types of statistical formula that, again, for mathematical reasons, we won’t go into. Here is the data again, but this time with the squared deviations in the 4th column.
## data mean dif.from.mean squared.Dif
## 1 8 4 4 16
## 2 5 4 1 1
## 3 3 4 -1 1
## 4 2 4 -2 4
## 5 2 4 -2 4
If we sum this 4th column, we get what is called the sum of squares, in this case 26. If we divide that sum by the size of the sample (\(n\)=5), we get what is called the Variance, or 26/5 = 5.2.
This variance is not intuitive. And you probably recall that to get here we had to square those scores in the 3rd column. To obtain the standard deviation, we need to take the square root of the variance, and so it becomes \(\sqrt{26/5} = \sqrt{5.2} = 2.28\) (Which isn’t too far off of the absolute difference method of 2).
This 2.28 is the standard deviation for our little sample.
Or is it?
This is where rules about statistics become a bit frustrating. It turns out that the formula I just gave you to calculate the variance and standard deviation underestimates the amount of variation for small sample sets. We know this because of many empirical and mathematical studies.
Here are the formulas for Variance and Standard Deviation that was just given:
Variance = \[ \frac{\Sigma(x_i-\bar{x})^2}{n} \]
Standard Deviation = \[ \sqrt{\frac{\Sigma(x_i-\bar{x})^2}{n}} \]
Because the above tend to underestimate the amount of variation in small samples, statisticians apply a “correction” to the formula: subtract n by 1:
Sample Variance = \[ \frac{\Sigma(x_i-\bar{x})^2}{(n-1)} \]
Sample Standard Deviation = \[ \sqrt{\frac{\Sigma(x_i-\bar{x})^2}{(n-1)}} \]
You may wonder when you should use which formula. Basically, if your sample size \(N\) is much larger than 30, you should use the first formula for standard deviation: \(\sqrt{\frac{\Sigma(x_i-\bar{x})^2}{n}}\).
But for samples that are less than 30, you should subtract 1 from \(n\): \(\sqrt{\frac{\Sigma(x_i-\bar{x})^2}{(n-1)}}\).
Why subtract 1? Math reasons. But empirically, practically, when you deal with data more than size N of 30, the two formulas (dividing by N, or N-1) produce essentially the same answer.
So, now we can try this with our data of years in college.
Here is the data again:
## yrs.college sources.used
## 1 0.25 7
## 2 1.50 4
## 3 1.50 4
## 4 1.50 3
## 5 1.50 NA
## 6 2.00 3
## 7 2.00 6
## 8 2.25 2
## 9 2.50 2
## 10 3.00 4
## 11 3.00 4
## 12 3.00 2
## 13 3.50 3
## 14 5.00 14
## 15 5.00 1
## 16 5.25 14
Let’s just focus on years of college. First we need to know the mean \(\bar{x}\), which you’ve already calculated above with header #3. Make a new column and repetitively put the mean on each line. Then in a 3rd column, insert the difference between the score and the mean. For simplicity, don’t round any results until the last step. Only then report no more than 1 decimal.
| yrs.college | \(\bar{x}\) | \(x_i-\bar{x}\) | \((x_i-\bar{x})^2\) |
|---|---|---|---|
| 0.25 | |||
| 1.5 | |||
| 1.5 | |||
| 1.5 | |||
| 1.5 | |||
| 2 | |||
| 2 | |||
| 2.25 | |||
| 2.5 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3.5 | |||
| 5 | |||
| 5 | |||
| 5.25 |
After you’ve filled out that last column, you can scan that the spread of differences. Remember that we are conceptually dealing with differences. And since it’s sorted, the smallest differences will be somewhere around the middle, for us near the 8th or 9th score.
So, what is the Variance for years in college from your sample (hint, use \(n-1\))
Yrs college:_______________
Now it’s just a matter of taking the square root of the variance above:
Yrs college:_______________
Scatter plots usually present two Non-categorical variables at a time. It’s best if they are both Continuous, which just means that the number could take any value within a range. So for example, continuous variables include height, weight, years in school. In our data, “Sources used” are counts, and it wouldn’t make sense to say that someone used 1.5 sources. But at least they are Interval data and so you can treat them as continuous for a simple graph.
As an example, I’ve randomly (that’s the ‘rnorm’ function) generated some 10 records with 2 scores, x and y. Again, these are just made up.
set.seed(1)
x<-round(rnorm(n=10, mean=22, sd=4))
y<-round(x+rnorm(n=10,mean=1, sd=2))
## x y
## 1 19 23
## 2 23 25
## 3 19 19
## 4 28 25
## 5 23 26
## 6 19 20
## 7 24 25
## 8 25 28
## 9 24 27
## 10 21 23
You can see in the graph below that record #2, with two data points, 23 and 25, a single point is made on the graph (marked by the dotted lines)
Next, you will plot the data the student data of years in college and sources used.
Here is the data, again:
## yrs.college sources.used
## 1 0.25 7
## 2 1.50 4
## 3 1.50 4
## 4 1.50 3
## 5 1.50 NA
## 6 2.00 3
## 7 2.00 6
## 8 2.25 2
## 9 2.50 2
## 10 3.00 4
## 11 3.00 4
## 12 3.00 2
## 13 3.50 3
## 14 5.00 14
## 15 5.00 1
## 16 5.25 14
With the data above, first choose which variable should go on the X and Y axis. The X axis is horizontal and usually is used for to ‘Predict from’ while the Y axis, the vertical one, is what you are trying to predict.
Then, pick 6 records and plot them (so you don’t have to waste your time doing them all).