Q: A teacher wishes to know whether the males in his/her class have more conservative attitudes than the females. A questionnaire is distributed assessing attitudes and the males and the females are compared. Is this an example of descriptive or inferential statistics?
A: Because inferential statistics is defined as those mathematical procedures whereby we convert information about a sample into intelligent guesses about a larger population, the example provided, where we have data on the entire class as a whole and wish to draw conclusions ON only the class, is an example of descriptive statistics.
Q: If you are told only that you scored in the 80th percentile, do you know from that description exactly how it was calculated? Explain.
A: Unfortunately, no. There are two subtle ways to interpret the definition of a percentile. The first, assumes the way to interpret percentile X is any score must be great then, the second assumption is that it can be greater than or equal to. There is even a third interpretation weights the scores and defines the median as the 50th percentile. In short, you need to be explicit when using “percentile” in your calculations.
Q: Give an example of an independent and a dependent variable.
A: An independent variable is one that is being manipulated or changed in order to compare outcomes against a dependent variable. For example, if I wanted to test the speed of a runner wearing four different brand running shoes. The speed of the runner would be dependent and the running shoes being tested would be independent.
Q: Specify the level of measurement used for the items in Question 6. Question 6 shows Age, , , and Time to response to a question.
A: Age = Ratio Scale
Country you were born in = Nominal/Categorical Scale
Favorite color = Nominal/Categorical Scale
Time to response to a question = Ratio Scale
Q: The formula for finding each student’s test grade (g) from his or her raw score (s) on a test is as follows: g = 16 + 3s
- Is this a linear transformation?
- If a student got a raw score of 20, what is his test grade?
A: Yes, this is a linear transformation. The function g(s) = 16 + 3s will follow a linear path. simply calculating expected test grade from raw score, when raw score = 20, give us g(20) = test grade = 16 + 3(20) = 76
Q: Which of the frequency polygons has a large positive skew? Which has a large negative skew?
A: A frequency polygon, representing the shape of the curve your data makes when it’s plotted, would be considered positively skewed if the data on the far end of the spectrum ( greater x’s) had a reduced prevalence and thus a long tail. Conversely, if your distribution shape started very small and then ramped up exponentially on toward the right, we would consider this negatively skewed.
Q: Name some ways to graph quantitative variables and some ways to graph qualitative variables.
A: For qualitative data you can graph frequency (in the form of a histogram) or relative frequency (in the form of a pie chart). For quantitative data you can plot just about any way you want - line, bard, area, histogram, etc..
Q: An experiment compared the ability of three groups of participants to remember briefly-presented chess positions. The data are shown below. The numbers represent the number of pieces correctly remembered from three chess positions. Create side-by-side box plots for these three groups. What can you say about the differences between these groups from the box plots? What can you say about the differences between these groups from the box plots?
knitr::opts_chunk$set(echo = TRUE)
require(tidyverse)
q2_file <- read.csv("Chapter two question three.csv", header = TRUE)
q2_file
## Non.Players Beginners Tournement.Players
## 1 22.1 32.50 40.1
## 2 22.3 37.10 45.6
## 3 26.2 39.10 51.2
## 4 29.6 40.50 56.4
## 5 31.7 45.50 58.1
## 6 33.5 51.30 71.1
## 7 38.9 52.60 74.9
## 8 39.7 55.70 75.9
## 9 43.2 55.98 80.3
## 10 43.2 57.70 85.3
summary(q2_file)
## Non.Players Beginners Tournement.Players
## Min. :22.10 Min. :32.50 Min. :40.10
## 1st Qu.:27.05 1st Qu.:39.45 1st Qu.:52.50
## Median :32.60 Median :48.40 Median :64.60
## Mean :33.04 Mean :46.80 Mean :63.89
## 3rd Qu.:39.50 3rd Qu.:54.92 3rd Qu.:75.65
## Max. :43.20 Max. :57.70 Max. :85.30
dat <- stack(as.data.frame(q2_file))
ggplot(dat) +
geom_boxplot(aes(x = ind, y = values)) +
theme_minimal()+
theme(axis.title.y=element_blank(),
axis.ticks.y=element_blank(),
axis.title.x=element_blank())
A: You can clearly see the box plots above, the the mean (dark line) of each plot increases with player ability - from 33.04 for Non.Players up to 63.89 for Tournament.Players. Interestingly you can also see the spread of scores also increases as you move upward through ability. E.g. the delta between Non.Players’ 25th and 75th percentiles (upper and lower limit of the boxes) is ~ 12, while the delta between Tournament.Players’ 25th and 75th percentiles is as high as 23.
Q: In a box plot, what percent of the scores are between the lower and upper hinges?
A: See above, the upper and lower bounds of the box plot represent the 25th and 756th percentiles of the data.
Q: For the data from the 1977 Stat. and Biom. 200 class for eye color, construct a:
A pie chart
B horizontal bar graph
C vertical bar graph
D a frequency table with the relative frequency of each eye color
q7_file <- read.csv("Eye Color Data.csv", header = TRUE)
library(ggplot2)
library(ggpubr)
q7_plot <- data.frame(q7_file)
head(q7_plot)
## Eye.Color
## 1 Brown
## 2 Brown
## 3 Brown
## 4 Brown
## 5 Brown
## 6 Brown
#Create and plot vertical bar chart
vert.barq7 <- ggplot(q7_plot, aes(x = "", y=Eye.Color, fill=Eye.Color)) +
geom_bar(width = 1, stat = "identity") +
scale_fill_brewer(palette="Blues") + theme_minimal() +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
#Create and plot pie chart
pieq7 <- vert.barq7 +
coord_polar("y", start=0) +
scale_fill_brewer(palette="Blues") +
theme_minimal() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
#Create and plot horizontal bar chart
hor.barq7 <- vert.barq7 +
coord_flip() +
scale_fill_brewer(palette="Blues") +
theme_minimal()+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
#Create and plot histogram frequency plot
histq7 <- ggplot(q7_plot, aes(Eye.Color)) +
geom_histogram(stat = "count", color='darkblue', fill = 'lightblue')
#Create an eyecolor table
eye.table <- table(q7_plot$Eye.Color)
#Arrange plots in a 2 x 2 view, label according to question
ggarrange(pieq7, hor.barq7, vert.barq7, histq7,
labels = c("A","B","C","D"),
ncol = 2, nrow = 2)
The frequency table associated with the Histogram in D is 10, 11, 1, 4 taken in order from left to right.
Q: Which of the box plots on the graph has a large positive skew? Which has a large negative skew?
A: The box plots as published in the exercises show an large positive skew in box plot C, and a negative skew in box plot B.
Q: Make up a data set of 12 numbers with a positive skew. Use a statistical program to compute the skew. Is the mean larger than the median as it usually is for distributions with a positive skew? What is the value for skew?
A: Skewness of this data set = 2.086943 and the mean, equal to 1.5567, is larger than the median of 0.9000. (see below for calculations)
q3_file <- read.csv("Week.one.homework.skew.data.set.csv", header = TRUE)
q3_file
## AVG.Children.12.Neighborhoods
## 1 6.00
## 2 3.00
## 3 2.50
## 4 1.40
## 5 1.30
## 6 1.00
## 7 0.80
## 8 0.75
## 9 0.73
## 10 0.50
## 11 0.40
## 12 0.30
summary(q3_file)
## AVG.Children.12.Neighborhoods
## Min. :0.3000
## 1st Qu.:0.6725
## Median :0.9000
## Mean :1.5567
## 3rd Qu.:1.6750
## Max. :6.0000
require(Rfast)
q3_data <- as.matrix(q3_file)
q3_data
## AVG.Children.12.Neighborhoods
## [1,] 6.00
## [2,] 3.00
## [3,] 2.50
## [4,] 1.40
## [5,] 1.30
## [6,] 1.00
## [7,] 0.80
## [8,] 0.75
## [9,] 0.73
## [10,] 0.50
## [11,] 0.40
## [12,] 0.30
colskewness(q3_data, pvalue = TRUE)
## skewness p-value
## [1,] 2.163501 0.0006868332
Q: Make up three data sets with 5 numbers each that have:
(a) the same mean but different standard deviations.
(b) the same mean but different medians.
(c) the same median but different means.
chap3_q3_ab <- read.csv("Question 3a and 3b.csv", header = TRUE)
chap3_q3_c <- read.csv("Question 3c.csv", header = TRUE)
chap3_q3_ab
## Data1 Data2 Data3
## 1 1 5 1
## 2 1 5 5
## 3 18 60 54
## 4 30 80 70
## 5 200 100 120
chap3_q3_c
## Data1 Data2 Data3
## 1 10 5 18
## 2 20 20 20
## 3 100 40 50
## 4 20 20 20
## 5 10 5 18
Taking our first data frame we can see:
mean of column 1 = 50
median of column 1 = 18
standard deviation of column 1 = 84.7437313
mean of column 2 = 50
median of column 2 = 60
standard deviation of column 2 = 43.445368
mean of column 3 = 50
median of column 3 54
standard deviation of column 3 = 49.3507852
Taking our second data frame we can see:
mean of column 1 = 32
median of column 1 = 20
standard deviation of column 1 = 38.340579
mean of column 2 = 18
median of column 2 = 20
standard deviation of column 2 = 14.4048603
mean of column 3 = 25.2
median of column 3 20
standard deviation of column 3 = 13.8996403
Q: “A sample of 30 distance scores measured in yards has a mean of 10, a variance of 9, and a standard deviation of 3 (a) You want to convert all your distances from yards to feet, so you multiply each score in the sample by 3. What are the new mean, variance, and standard deviation? (b) You then decide that you only want to look at the distance past a certain point. Thus, after multiplying the original scores by 3, you decide to subtract 4 feet from each of the scores. Now what are the new mean, variance, and standard deviation?
A: By creating a sample data set of 30 values that matched the criteria outlined above, then multiplying by 3 and ultimately subtracting 4, we get the below three values and column. You can see when scaling everything by a factor of 3, our calculations held true, but when we subtracted we ended up with the same standard deviation and variance, but our average shifted by exactly our delta!
chap3_q7 <- read.csv("Ch3_Q7.csv", header = TRUE)
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'Ch3_Q7.csv'
as.data.frame(chap3_q7)
## X From_30_Base_Distance_Scores X3X X.3X..4
## 1 Mean 10 30 26
## 2 Variance 9 81 81
## 3 Standard Deviation 3 9 9
Q: For the test scores in question #6, which measures of variability (range, standard deviation, variance) would be changed if the 22.1 data point had been erroneously recorded as 21.2?
q7_data <- c(15.2, 18.8, 19.3, 19.7, 20.2, 21.8, 22.1, 29.4)
q7_error_data <- c(15.2, 18.8, 19.3, 19.7, 20.2, 21.8, 21.2, 29.4)
Range of correct data = 15.2, 29.4
Standard deviation of correct data = 4.067796
Variance of correct data = 16.5469643
Range of incorrect data = 15.2, 29.4
Standard deviation of incorrect data = 4.0394483
Variance of incorrect data = 16.3171429
A: As you can see, when the erroneous score was added (shown in q7_error_data) we have an incorrectly high standard deviation and a deflated variance as the bad data moved closer to the mean. The range stayed true because we simply did not affect the upper and lower bounds of the data.
Q: For the numbers 1, 3, 4, 6, and 12: Find the value (v) for which Σ(X-v)2 is minimized.Find the value (v) for which Σ|x-v| is minimized.
A: If we have (X-1)2 + (X-3)2 + (X-4)2 + (X-6)2 + (X-12)2 = 0.
The closest we will come to zero is minimizing our variance, which will happen when we make X the average of our values V. Therefor X = 5.2 is the best value.
Q: An experiment compared the ability of three groups of participants to remember briefly- presented chess positions. The data are shown below. The numbers represent the number of pieces correctly remembered from three chess positions. Compare the performance of each group. Consider spread as well as central tendency.
A: See Chapter 2, Question 3 above. Reiterating below but suppressing code chunks for clarity. As noted in the previous chapter, we see clearly the memory of Tournament players exceeds those of Non Players and Beginners; they excel in IQR, median, and mean. But the one other thing to note is the Non Players had the most consistent data source, i.e. the least spread.
## Non.Players Beginners Tournement.Players
## 1 22.1 32.50 40.1
## 2 22.3 37.10 45.6
## 3 26.2 39.10 51.2
## 4 29.6 40.50 56.4
## 5 31.7 45.50 58.1
## 6 33.5 51.30 71.1
## 7 38.9 52.60 74.9
## 8 39.7 55.70 75.9
## 9 43.2 55.98 80.3
## 10 43.2 57.70 85.3
## Non.Players Beginners Tournement.Players
## Min. :22.10 Min. :32.50 Min. :40.10
## 1st Qu.:27.05 1st Qu.:39.45 1st Qu.:52.50
## Median :32.60 Median :48.40 Median :64.60
## Mean :33.04 Mean :46.80 Mean :63.89
## 3rd Qu.:39.50 3rd Qu.:54.92 3rd Qu.:75.65
## Max. :43.20 Max. :57.70 Max. :85.30
Q: True/False: The best way to describe a skewed distribution is to report the mean.
A: False, you would also need a median measuremnent to see which way it leaned in comparison to the mean. That said, the best way to see skewness is the plot the data and visualize.
Q: Compare the mean, median, trimean in terms of their sensitivity to extreme scores.
A: The median is the least sensitive, as one or two outliers will only move the median one or two positions left or right. The trimean would be the next most sensitive, because not only does it depend on quartiles, which will move more sluggishly when you add in a few outliers (since they are calculated with many other pieces of data) but also because the mean in the trimnean calculation is multiplied by 2*(1/4), further watering down the affect. The most susceptible would be the mean.
Q: A set of numbers is transformed by taking the log base 10 of each number. The mean of the transformed data is 1.65. What is the geometric mean of the untransformed data?
A: Since we know the sum of all the log data divided by the number of values = 1.65, we can write ∑ log(X)i (from i = 1 to N) divided by N = 1.65.
By leveraging rules of Logs we can convert that into ∑(X)i (from i = 1 to N) = 10^N*1.65.
And if I take both sides by the power of 1/N, the right side of my equation turns into 10^1.65 and the left side actually transforms into the geometric mean formula!
The answer is then 10^1.65 or 44.6683592
Q: The histogram is in balance on the fulcrum. What are the mean, median, and mode of the distribution (approximate where necessary)?
A: Is the histogram we see fulcrum is balanced at approximately 4.5. And since the mean is where the data is “balanced” then the mean is 4.5. Visually I can pick out the data with the greatest frequency, the mode, which is equal to 1.0. In order to determine median I need to estimate the approximate number of values in each column of the histogram.
1’s = 95
2’s = 75
3’s = 60
4’s = 55
5’s = 40
6’s = 35
7’s = 30
8’s = 25
9’s = 20
10’s = 18
11’s = 16
12’s = 14
13’s = 10
14’s = 7
15’s = 5
Since I have ~ 500 data points, the 250th would give me the median, which falls under 4. The median is then 4.
Q: Describe the relationship between variables A and C. Think of things these variables could represent in real life
A: The plot pictured shows declining values of C as increasing values of A, an inverse relationship, or negative correlation. In real life that could be he MPG (C) as engine volume increases (A). Or it could be exercise stamina (C) declining with Age (A). Anything with an inverse relationship.
Q: Make up a data set with 10 numbers that has a negative correlation.
c4_q3x <- c(0, 3, 5, 7, 15, 17, 20, 22, 25, 30)
c4_q3y <- c(8, 7, 6, 5, 5, 4, 3.8, 4, 2, 0.5)
df_c4_q3 <- data.frame(c4_q3x, c4_q3y)
ggplot(df_c4_q3, aes(x=c4_q3x, y=c4_q3y)) +
geom_point(size = 2, shape = 23) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Q: Would you expect the correlation between High School GPA and College GPA to be higher when taken from your entire high school class or when taken from only the top 20 students? Why?
A: Correlation should be higher from top 20 students. But selectively choosing those students who logically are best prepared for College you are entering bias into the calculation that should drive similarities in behavior between the two groups. Choosing the whole high school will capture a broader range of academic behaviors, reducing correlation.
Q: For this same class, the relationship between the amount of time spent studying and the amount of time spent socializing per week was also examined. It was determined that the more hours they spent studying, the fewer hours they spent socializing. Is this a positive or negative association?
A: This is an example of negative correlation. The inverser relationship would be the same as plotted in the above.
Q: Students took two parts of a test, each worth 50 points. Part A has a variance of 25, and Part B has a variance of 49. The correlation between the test scores is 0.6. (a) If the teacher adds the grades of the two parts together to form a final test grade, what would the variance of the final test grades be? (b) What would the variance of Part A - Part B be?
A: By simply recreating two data sets to mirror the output above, I can see the affect to the final variance by adding the scores together would be to have a combined variance of 35. The difference btween the two variances when taken independanly is simply 25.
Q: True/False: It is possible for variables to have r=0 but still have a strong association.
A: TRUE! This is just like what the professor asked about in class when he plotted non-linear associations, like dots on a perfect circle. So yes, you can have zero correlation calculated from linear correlation methods, when in fact there is strong associations between the data.
Q: True/False: After polling a certain group of people, researchers found a 0.5 correlation between the number of car accidents per year and the driver’s age. This means that older people get in more accidents.
A: FALSE. The data suggests a middling correlation between the two variables, and since it’s on the lower threshold of moderate, you could argue it tells us nothing.
Q: True/False: To examine bivariate data graphically, the best choice is two side by side histograms.
A: FALSE: Bivariate data by definition should be plotted across two axis of data, not viewed, as a histogram is, in frequencies of categorical data.
Q: Plot a histogram of the distribution of the Control-Out scores.
angry_moods <- read.csv("angry_moods.csv", header = TRUE)
angry_moods_q10 <- as.data.frame(angry_moods)
ggplot(angry_moods, aes(Control.Out)) +
geom_histogram(binwidth = 1, color = "darkblue", fill = "white") +
theme_minimal()
Q: What is the overall mean Control-Out score? What is the mean Control-Out score for the athletes? What is the mean Control-Out score for the non-athletes?
str(angry_moods_q10)
## 'data.frame': 78 obs. of 7 variables:
## $ Gender : int 2 2 2 2 1 1 1 2 2 2 ...
## $ Sports : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Anger.Out : int 18 14 13 17 16 16 12 13 16 12 ...
## $ Anger.In : int 13 17 14 24 17 22 12 16 16 16 ...
## $ Control.Out : int 23 25 28 23 26 25 31 22 22 29 ...
## $ Control.In : int 20 24 28 23 28 23 27 31 24 29 ...
## $ Anger_Expression: int 36 30 19 43 27 38 14 24 34 18 ...
angry_moods_q11_athlete <- angry_moods_q10[angry_moods_q10$Sports == 1, ]
angry_moods_q11_non_athlete <- angry_moods_q10[angry_moods_q10$Sports == 2, ]
The mean of control out for the whole data set is = 23.6923077
The mean of control out for the athletes is = 24.68
The mean of control out for the non-athletes = 23.2264151
Q: Plot parallel box plots of the Anger Expression Index by sports participation. Does it look like there are any outliers? Which group reported expressing more anger?
(needed to convert numeric Sports columnar data into charcter for split boxplots)
angry_moods_q10$Sports <- as.character(angry_moods_q10$Sports)
ggplot(angry_moods_q10, aes(Sports,Anger_Expression)) +
geom_jitter () +
stat_boxplot(fill=NA) +
theme_minimal()
A: There is clearly more outliers AND anger amongst non-athletes, labeled as “2” in the plot above.
Q: Plot parallel box plots of the Anger Expression Index by gender.
(Similar to Sports column, I converted Gender into a charact to represent categorical data)
angry_moods_q10$Gender <- as.character(angry_moods_q10$Gender)
ggplot(angry_moods_q10, aes(Gender,Anger_Expression)) +
geom_jitter () +
stat_boxplot(fill=NA) +
theme_minimal()
Q: What is the correlation between the Control-In and Control-Out scores? Is this correlation statistically significant at the 0.01 level?
ct1 <- cor.test(angry_moods$Control.In, angry_moods$Control.Out)
ct1
##
## Pearson's product-moment correlation
##
## data: angry_moods$Control.In and angry_moods$Control.Out
## t = 9.0261, df = 76, p-value = 1.19e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5914163 0.8118649
## sample estimates:
## cor
## 0.7192834
We can see the correlation between these two columns is ~0.72, fairly strong. And our P-Value is 1.2e-13, meaning we have statistical significance!
Q: Would you expect the correlation between the Anger-Out and Control-Out scores to be positive or negative? Compute this correlation.
ct2 <- cor.test(angry_moods$Anger.In, angry_moods$Anger.Out)
ct2
##
## Pearson's product-moment correlation
##
## data: angry_moods$Anger.In and angry_moods$Anger.Out
## t = 0.14026, df = 76, p-value = 0.8888
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.207186 0.237766
## sample estimates:
## cor
## 0.01608639
We can see the correlation between these two columns is ~0.02, which is extrememly weak, not to mention we have a high p-value.