Define mean, median, and mode as measures of central tendency. Under what conditions is the mean preferred? The median? The mode?
The mean is sum of all observations and divide by the total number of measurements. The mean is preferred when the distribution is continuous and symmetric (i.e. a Normal distribution)
The median is known as the 50th percentile. It represents the midpoint of a set of observations when they are arranged in increasing order. Not sensitive to very large values, very small values, or outliers.
The mode is the value that occurs most frequently. The mode is more appropriate for nominal or ordinal data.
Define the three commonly used measures of dispersion; Range, Interquartile range, and standard deviation
Range is a measure of spread. The range is the difference between the largest and smallest observations in a sample.
Interquartile Range is another measure of spread given by the acronym (IQR). The IQR is the difference between the 75th percentile and the 25th percentile in a sample.
The standard deviation is a measure of the amount of variation or dispersion of a set of values.
Define sensitivity, specificity, positive predictive value, negative predictive value
Sensitivity is the ability of a test to correctly identify those with the disease (true positive rate).
Specificity is the ability of the test to correctly identify those without the disease (true negative rate).
Positive predictive value is the probability that subjects with a positive screening test truly have the disease.
Negative predictive value is the probability that subjects with a negative screening test truly don’t have the disease.
The Table 1 displays the forced expiratory volumes in 1 second for 13 adolescents suffering from asthma. BY HAND calculate the mean, the median, variance and standard deviation.
a) Mean = (2.30+ 2.15 + 3.50 + 2.60 + 2.75 + 2.82 + 4.05 + 2.25 + 2.68 + 3.00 +4.02 + 2.85 +3.38)/13 = 2.95L
b) Median = (n+1)/2 = 14/2 = 7th number is the median → 2.82L
2.15, 2.25, 2.30, 2.60, 2.68, 2.75, 2.82, 2.85, 3.00, 3.38, 3.50, 4.02, 4.05
c) Sample Variance = [(2.30-2.95)^2 + (2.15-2.95)^2 + (3.50-2.95)^2 + (2.60-2.95)^2 + (2.75-2.95)^2 + (2.82-2.95)^2 + (4.05-2.95)^2 + (2.25-2.95)^2 + (2.68-2.95)^2 + (3.00-2.95)^2 + (4.02 -2.95)^2 + (2.85 -2.95)^2 + (3.38-2.95)^2]/ (13-1) = 0.39L^2
d) Standard Deviation: sqrt of 0.39 = 0.62L
The declared concentrations of tar for 35 brands of Canadian cigarettes are saved under the variable name tar in the data set lab2.csv (it is posted on the course website). Open this data in Stata, and use it to answer the following questions. Use the Stata commands from last weeks Introduction to Stata to answer parts a-c.
lab2 <- read.csv("lab2.csv")
mean(lab2$tar)
## [1] 11.50286
sd(lab2$tar)
## [1] 5.299416
median(lab2$tar)
## [1] 13
max(lab2$tar) - min(lab2$tar)
## [1] 18.3
IQR(lab2$tar)
## [1] 6.5
hist(lab2$tar, breaks = 6, col = "black",
main = "tar measurements",
xlab = "Values",
ylab = "Frequency")
The data appears to be negative skewed (skewed to the left).
mean(lab2$tar)
## [1] 11.50286
median(lab2$tar)
## [1] 13
#Due to the skewed dataset with outliers, median is the best measure of central tendency since it represents the midpoint of a set of observations when they are arranged in increasing order and is not sensitive to very large values, very small values, or outliers.
For this question, the dataset we will use is fac_sal.csv (posted on the course website). This dataset contains information about faculty salaries from Bowling Green State University. The data are from Regression with Social Data, A. DeMaris, John Wiley and Sons, 2004. Variables of interest for this assignment include:
• ay_salary = Academic Salary (per Year) • yrs_emp = Number of years of employment with the college • female = 1 if female, 0 if male • fac_rank = Indicates whether faculty member is: an assistant professor, an associate professor, a full professor, or an instructor
fac_sal <- read.csv("fac_sal.csv")
hist(fac_sal$ay_salary, breaks = 28, col = "green",
main = "Academic Year Salaries",
xlab = "$",
ylab = "Frequency")
#Based on the histogram of academic year salaries, it appears that the distribution is right-skewed (aka positive-skewed).
boxplot(fac_sal$ay_salary, data = fac_sal, col = "green",
main = "Academic Year Salaries", xlab = "faculty", ylab = "$")
#Yes, the boxplot reflects the same skew I noticed in the histogram. There are a few faculty members who have extremely high salaries.
#Due to the outliers, median should be a better measure of central tendency in this instance. The mean will undoubtedly be affected by the few faculty that have extremely high salaries, so I expect the mean to be larger than the median (confirmed by descriptive summary in R shown below)
summary(fac_sal$ay_salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23800 36833 46728 47801 57573 103041
sd(fac_sal$ay_salary)
## [1] 13991.9
max(fac_sal$ay_salary) - min(fac_sal$ay_salary)
## [1] 79241
IQR(fac_sal$ay_salary)
## [1] 20740
# Considering that some of the lowest paid faculty member's salaries are in the ballpark 20k-30k, the three measures calculated show that there is indeed a tremendous variability of academic salaries at this institution.
# Yes, there are outliers in the academic salaries. For example:
max(fac_sal$ay_salary)
## [1] 103041
# The highest paid faculty member received $103,041. This is an example of an outlier.
# The box plot also revealed the outliers as shown above. The IQR was previously calculated at $20,740. An outlying value (or an outlier) is a measurement/observation x such that either: x > 75th percentile + 1.5×IQR
quantile(fac_sal$ay_salary, c(.75))
## 75%
## 57573
57573 + 1.5 * 20740
## [1] 88683
# Here are the outliers:
fac_sal$ay_salary[fac_sal$ay_salary > 88683]
## [1] 89789 90082 91405 91489 96156 96744 103041
quantile(fac_sal$ay_salary, (.20))
## 20%
## 34731
# The number of females in this data set:
length(fac_sal$female[fac_sal$female == "yes"])
## [1] 214
# The number of males in this data set:
length(fac_sal$female[fac_sal$female == "no"])
## [1] 511
# Alternatively,:
table(fac_sal$female)
##
## no yes
## 511 214
# Using the top 4 salaries (91489, 96156, 96744, 103041) from previously identifying the outliers, I can check for the sex of each 4 of the top paid faculty members.
str(fac_sal$female[fac_sal$ay_salary == "103041"])
## chr "no"
str(fac_sal$female[fac_sal$ay_salary == "96744"])
## chr "no"
str(fac_sal$female[fac_sal$ay_salary == "96156"])
## chr "no"
str(fac_sal$female[fac_sal$ay_salary == "91489"])
## chr "no"
# As shown above, none of the 4 top-paid faculty members are female, which means that the sex of the all 4 faculty members with the highest academic salaries are MALE.
Male <- fac_sal$ay_salary[fac_sal$female == "no"]
Female <- fac_sal$ay_salary[fac_sal$female == "yes"]
boxplot(Male, Female, col=c("blue","pink"),
main = "Comparison of Male and Female Faculty Salaries", names=c("Male","Female"), ylab = "$")
Yes, there appears to be a different between the salaries of male faculty versus female faculty.