The mean (average) of a data set is found by adding all numbers in the data set and then dividing by the number of values in the set. This is used when the distribution is continuous.
The median is the middle number in a sorted, ascending or descending, list of numbers and can be more descriptive of that data set than the average. The median is sometimes used as opposed to the mean when there are outliers in the sequence that might skew the average of the values.
The mode is the value that appears most often in a set of data values. The mode is often used for nominal or ordinal data.
Range is a measure of spread. The range is the difference between the largest and smallest observations in a sample.
Interquartile Range is another measure of spread given by the acronym (IQR). The IQR is the difference between the 75th percentile and the 25th percentile in a sample.
The standard deviation is a measure of the amount of variation or dispersion of a set of values.
Sensitivity is the ability of a test to correctly identify those with the disease (true positive rate).
Specificity is the ability of the test to correctly identify those without the disease (true negative rate).
Positive predictive value is the probability that subjects with a positive screening test truly have the disease.
Negative predictive value is the probability that subjects with a negative screening test truly don’t have the disease.
The Table 1 displays the forced expiratory volumes in 1 second for 13 adolescents suffering from asthma. BY HAND calculate the mean, the median, variance and standard deviation.
Mean = (2.30+ 2.15 + 3.50 + 2.60 + 2.75 + 2.82 + 4.05 + 2.25 + 2.68 + 3.00 +4.02 + 2.85 +3.38)/13 = 2.95L
Median = (n+1)/2 = 14/2 = 7th number is the median → 2.82L
2.15, 2.25, 2.30, 2.60, 2.68, 2.75, 2.82, 2.85, 3.00, 3.38, 3.50, 4.02, 4.05
Sample Variance = [(2.30-2.95)^2 + (2.15-2.95)^2 + (3.50-2.95)^2 + (2.60-2.95)^2 + (2.75-2.95)^2 + (2.82-2.95)^2 + (4.05-2.95)^2 + (2.25-2.95)^2 + (2.68-2.95)^2 + (3.00-2.95)^2 + (4.02 -2.95)^2 + (2.85 -2.95)^2 + (3.38-2.95)^2]/ (13-1) = 0.39L^2
Standard Deviation: sqrt of 0.39 = 0.62L
lab2 = read.csv("lab2.csv")
mean(lab2$tar)
## [1] 11.50286
sd(lab2$tar)
## [1] 5.299416
median(lab2$tar)
## [1] 13
max(lab2$tar)-min(lab2$tar)
## [1] 18.3
IQR(lab2$tar)
## [1] 6.5
hist(lab2$tar, breaks = 6, col = "black",main = "Tar Measurements", xlab = "Values", ylab = "Frequency")
The data is skewed to the left.
The median is most likely the best measure of central tendency. This is because the median is not as largely affected by outliers that have very large or small numbers.
fac_sal <- read.csv("fac_sal.csv")
a)What does the distribution of academic year salaries look like? Is the distribution symmetric? Right-skewed? Left-skewed? Hint: Please describe the data graphically through a Histogram.
hist(fac_sal$ay_salary, breaks = 28, col = "blue", main = "Academic Year Salaries",xlab = "$",ylab = "Frequency")
This data appears to be asymmetric and right skewed.
b). How does the boxplot reflect what you noticed in the histogram? Are there faculty members who have particularly high salaries?Hint: Please describe the data graphically through a boxplot.
boxplot(fac_sal$ay_salary, data=fac_sal, col= "blue", main= "Academic Year Salaries", xlab="faculty", ylab="$")
The data from the boxplot confirms the dat from the histogram. There are a few faculty who have particularly high salaries.
The median would better represent the central tendency due to the large outliers in this case. Therefore I would expect the mean academic salary to be larger.
summary(fac_sal$ay_salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23800 36833 46728 47801 57573 103041
d)How variable are the academic salaries? Hint: Please give three measures for the variability; Stand. Dev., Range, IQR.etc.
sd(fac_sal$ay_salary)
## [1] 13991.9
max(fac_sal$ay_salary-min(fac_sal$ay_salary))
## [1] 79241
IQR(fac_sal$ay_salary)
## [1] 20740
These values demonstrate that there is a large degree of variability amognst the academic salaries.
e)Are there any outliers in the academic salaries? Hint: Please give an explanation from the graphs and do the mathematical calculations.
Yes there are outliers in the academic salaries.
Outliers are measurements that are greater than the 75th percentile plus 1.5xIQR
quantile(fac_sal$ay_salary, c(.75))
## 75%
## 57573
57573 + 1.5*20740
## [1] 88683
fac_sal$ay_salary[fac_sal$ay_salary>88683]
## [1] 89789 90082 91405 91489 96156 96744 103041
The values above are outliers
f)What is the 20th percentile of the academic salaries? Hint: Use the quantile command
quantile(fac_sal$ay_salary, (.20))
## 20%
## 34731
How many males are in the data set? How many females are in the data set? Hint: Please use the ”table” command.
Females
length(fac_sal$female[fac_sal$female== "yes"])
## [1] 214
Males
length(fac_sal$female[fac_sal$female == "no"])
## [1] 511
table(fac_sal$female)
##
## no yes
## 511 214
h)What is the sex of the four faculty members with the highest academic salaries? Hint: Please sort the data using the sort command and check for that.
str(fac_sal$female[fac_sal$ay_salary == "103041"])
## chr "no"
str(fac_sal$female[fac_sal$ay_salary == "96744"])
## chr "no"
str(fac_sal$female[fac_sal$ay_salary == "96156"])
## chr "no"
str(fac_sal$female[fac_sal$ay_salary == "91489"])
## chr "no"
The sex of the highest 4 salary earners are male as demonstrated by the “no” response to female.
Male = fac_sal$ay_salary[fac_sal$female == "no"]
Female = fac_sal$ay_salary[fac_sal$female == "yes"]
boxplot(Male, Female, col=c("blue", "pink"), main = "Comparison of male and female salaries", names=c ("Male","Female"), ylab = "$")
It appears as though the salaries of males are higher than females in this data set.