Definitions

  1. Define mean, median, and mode as measures of central tendency. Under what conditions is the mean preferred? The median? The mode?

The mean (average) of a data set is found by adding all numbers in the data set and then dividing by the number of values in the set. This is used when the distribution is continuous.

The median is the middle number in a sorted, ascending or descending, list of numbers and can be more descriptive of that data set than the average. The median is sometimes used as opposed to the mean when there are outliers in the sequence that might skew the average of the values.

The mode is the value that appears most often in a set of data values. The mode is often used for nominal or ordinal data.

  1. Define the three commonly used measures of dispersion; Range, Interquartile range, and standard deviation

Range is a measure of spread. The range is the difference between the largest and smallest observations in a sample.

Interquartile Range is another measure of spread given by the acronym (IQR). The IQR is the difference between the 75th percentile and the 25th percentile in a sample.

The standard deviation is a measure of the amount of variation or dispersion of a set of values.

  1. Define sensitivity, specificity, positive predictive value, negative predictive value

Sensitivity is the ability of a test to correctly identify those with the disease (true positive rate).

Specificity is the ability of the test to correctly identify those without the disease (true negative rate).

Positive predictive value is the probability that subjects with a positive screening test truly have the disease.

Negative predictive value is the probability that subjects with a negative screening test truly don’t have the disease.

Calculations

The Table 1 displays the forced expiratory volumes in 1 second for 13 adolescents suffering from asthma. BY HAND calculate the mean, the median, variance and standard deviation.

  1. Mean = (2.30+ 2.15 + 3.50 + 2.60 + 2.75 + 2.82 + 4.05 + 2.25 + 2.68 + 3.00 +4.02 + 2.85 +3.38)/13 = 2.95L

  2. Median = (n+1)/2 = 14/2 = 7th number is the median → 2.82L

2.15, 2.25, 2.30, 2.60, 2.68, 2.75, 2.82, 2.85, 3.00, 3.38, 3.50, 4.02, 4.05

  1. Sample Variance = [(2.30-2.95)^2 + (2.15-2.95)^2 + (3.50-2.95)^2 + (2.60-2.95)^2 + (2.75-2.95)^2 + (2.82-2.95)^2 + (4.05-2.95)^2 + (2.25-2.95)^2 + (2.68-2.95)^2 + (3.00-2.95)^2 + (4.02 -2.95)^2 + (2.85 -2.95)^2 + (3.38-2.95)^2]/ (13-1) = 0.39L^2

  2. Standard Deviation: sqrt of 0.39 = 0.62L

  1. The declared concentrations of tar for 35 brands of Canadian cigarettes are saved under the variable name tar in the data set lab2.csv (it is posted on the course website). Open this data in Stata, and use it to answer the following questions. Use the R commands from last weeks Introduction to R to answer parts a-c.
lab2 = read.csv("lab2.csv")
  1. Find the mean and standard deviation of the concentrations of tar.Note: Don’t forget the units (check the R commands: summary, mean, and sd)
mean(lab2$tar)
## [1] 11.50286
sd(lab2$tar)
## [1] 5.299416
  1. Find the median, range, and interquartile range of tar.
median(lab2$tar)
## [1] 13
max(lab2$tar)-min(lab2$tar)
## [1] 18.3
IQR(lab2$tar)
## [1] 6.5
  1. Using R, produce a histogram of the tar measurements. Remember to use the frequency option. Describe the shape of the values.
hist(lab2$tar, breaks = 6, col = "black",main = "Tar Measurements",  xlab = "Values", ylab = "Frequency")

The data is skewed to the left.

  1. Which number (mean or median) do you think provides the best measure of central tendency? Why?

The median is most likely the best measure of central tendency. This is because the median is not as largely affected by outliers that have very large or small numbers.

  1. For this question, the dataset we will use is f ac sal.csv (posted on the course website). This dataset contains information about faculty salaries from Bowling Green State University. The data are from Regression with Social Data, A. DeMaris, John Wiley and Sons, 2004. Variables of interest for this assignment include: 1 • ay salary = Academic Salary (per Year) • yrs emp = Number of years of employment with the college • female = 1 if female, 0 if male • f ac rank = Indicates whether faculty member is: an assistant professor, an associate professor, a full professor, or an instructor
fac_sal <- read.csv("fac_sal.csv")

a)What does the distribution of academic year salaries look like? Is the distribution symmetric? Right-skewed? Left-skewed? Hint: Please describe the data graphically through a Histogram.

hist(fac_sal$ay_salary, breaks = 28, col = "blue", main = "Academic Year Salaries",xlab = "$",ylab = "Frequency")

This data appears to be asymmetric and right skewed.

b). How does the boxplot reflect what you noticed in the histogram? Are there faculty members who have particularly high salaries?Hint: Please describe the data graphically through a boxplot.

boxplot(fac_sal$ay_salary, data=fac_sal, col= "blue", main= "Academic Year Salaries", xlab="faculty", ylab="$")

The data from the boxplot confirms the dat from the histogram. There are a few faculty who have particularly high salaries.

  1. Based on parts (1) and (2), would you expect the mean or median academic salary to be larger? Which measure of central tendency do you think is better in this case? Hint: Please confirm your answer by descriptive summary, use the command summary, mean, sd, median commands.

The median would better represent the central tendency due to the large outliers in this case. Therefore I would expect the mean academic salary to be larger.

summary(fac_sal$ay_salary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   23800   36833   46728   47801   57573  103041

d)How variable are the academic salaries? Hint: Please give three measures for the variability; Stand. Dev., Range, IQR.etc.

sd(fac_sal$ay_salary)
## [1] 13991.9
max(fac_sal$ay_salary-min(fac_sal$ay_salary))
## [1] 79241
IQR(fac_sal$ay_salary)
## [1] 20740

These values demonstrate that there is a large degree of variability amognst the academic salaries.

e)Are there any outliers in the academic salaries? Hint: Please give an explanation from the graphs and do the mathematical calculations.

Yes there are outliers in the academic salaries.

Outliers are measurements that are greater than the 75th percentile plus 1.5xIQR

quantile(fac_sal$ay_salary, c(.75))
##   75% 
## 57573
57573 + 1.5*20740
## [1] 88683
fac_sal$ay_salary[fac_sal$ay_salary>88683]
## [1]  89789  90082  91405  91489  96156  96744 103041

The values above are outliers

f)What is the 20th percentile of the academic salaries? Hint: Use the quantile command

quantile(fac_sal$ay_salary, (.20))
##   20% 
## 34731

How many males are in the data set? How many females are in the data set? Hint: Please use the ”table” command.

Females

length(fac_sal$female[fac_sal$female== "yes"])
## [1] 214

Males

length(fac_sal$female[fac_sal$female == "no"])
## [1] 511
table(fac_sal$female)
## 
##  no yes 
## 511 214

h)What is the sex of the four faculty members with the highest academic salaries? Hint: Please sort the data using the sort command and check for that.

str(fac_sal$female[fac_sal$ay_salary == "103041"])
##  chr "no"
str(fac_sal$female[fac_sal$ay_salary == "96744"])
##  chr "no"
str(fac_sal$female[fac_sal$ay_salary == "96156"])
##  chr "no"
str(fac_sal$female[fac_sal$ay_salary == "91489"])
##  chr "no"

The sex of the highest 4 salary earners are male as demonstrated by the “no” response to female.

  1. Compare the salaries of males and females by plotting boxplots by sex side-by-side. Does there appear to be a difference between males and females?
Male = fac_sal$ay_salary[fac_sal$female == "no"]
Female = fac_sal$ay_salary[fac_sal$female == "yes"]

boxplot(Male, Female, col=c("blue", "pink"), main = "Comparison of male and female salaries", names=c ("Male","Female"), ylab = "$")

It appears as though the salaries of males are higher than females in this data set.