OCS 283 Problem Set #1

DEFINITIONS

Define mean, median, and mode as measures of central tendency. Under what conditions is the mean preferred? The median? The mode?
1. The mean is sum of all observations and divide by the total number of measurements. The mean is preferred when the distribution is continuous and symmetric (i.e. a Normal distribution)
2. The median is known as the 50th percentile. It represents the midpoint of a set of observations when they are arranged in increasing order. Not sensitive to very large values, very small values, or outliers.
3. The mode is the value that occurs most frequently. The mode is more appropriate for nominal or ordinal data.
Define the three commonly used measures of dispersion; Range, Interquartile range, and standard deviation
1. Range is a measure of spread. The range is the difference between the largest and smallest observations in a sample.
2. Interquartile Range is another measure of spread given by the acronym (IQR). The IQR is the difference between the 75th percentile and the 25th percentile in a sample.
3. The standard deviation is a measure of the amount of variation or dispersion of a set of values.
Define sensitivity, specificity, positive predictive value, negative predictive value
1. Sensitivity is the ability of a test to correctly identify those with the disease (true positive rate).
2. Specificity is the ability of the test to correctly identify those without the disease (true negative rate).
3. Positive predictive value is the probability that subjects with a positive screening test truly have the disease.
4. Negative predictive value is the probability that subjects with a negative screening test truly don’t have the disease.

CALCULATIONS

The Table 1 displays the forced expiratory volumes in 1 second for 13 adolescents suffering from asthma. BY HAND calculate the mean, the median, variance and standard deviation.

a) Mean = (2.30+ 2.15 + 3.50 + 2.60 + 2.75 +  2.82 + 4.05 + 2.25 + 2.68 + 3.00 +4.02 + 2.85 +3.38)/13 = 2.95L

b) Median = (n+1)/2 = 14/2 = 7th number is the median → 2.82L

2.15, 2.25, 2.30, 2.60, 2.68, 2.75, 2.82, 2.85, 3.00, 3.38, 3.50, 4.02, 4.05 

c) Sample Variance = [(2.30-2.95)^2 + (2.15-2.95)^2 + (3.50-2.95)^2 + (2.60-2.95)^2 + (2.75-2.95)^2 + (2.82-2.95)^2 + (4.05-2.95)^2 + (2.25-2.95)^2 + (2.68-2.95)^2 + (3.00-2.95)^2 + (4.02 -2.95)^2 + (2.85 -2.95)^2 + (3.38-2.95)^2]/ (13-1) = 0.39L^2

d) Standard Deviation: sqrt of 0.39 = 0.62L

R

The declared concentrations of tar for 35 brands of Canadian cigarettes are saved under the variable name tar in the data set lab2.csv (it is posted on the course website). Open this data in Stata, and use it to answer the following questions. Use the Stata commands from last weeks Introduction to Stata to answer parts a-c.

lab2 <- read.csv("lab2.csv")

Find the mean and standard deviation of the concentrations of tar.Note: Don’t forget the units (check the label, Stata command: describe)

mean(lab2$tar)

## [1] 11.50286

sd(lab2$tar)

## [1] 5.299416

Find the median, range, and interquartile range of tar.

median(lab2$tar)

## [1] 13

max(lab2$tar) - min(lab2$tar)

## [1] 18.3

IQR(lab2$tar)

## [1] 6.5

Using R, produce a histogram of the tar measurements. Remember to use the frequency option. Describe the shape of the values.

hist(lab2$tar, breaks = 6, col = "black", 
     main = "tar measurements",
     xlab = "Values",
     ylab = "Frequency")

The data appears to be negative skewed (skewed to the left).

Which number (mean or median) do you think provides the best measure of central tendency? Why?

mean(lab2$tar)

## [1] 11.50286

median(lab2$tar)

## [1] 13

#Due to the skewed dataset with outliers, median is the best measure of central tendency since it represents the midpoint of a set of observations when they are arranged in increasing order and is not sensitive to very large values, very small values, or outliers.

For this question, the dataset we will use is fac_sal.csv (posted on the course website). This dataset contains information about faculty salaries from Bowling Green State University. The data are from Regression with Social Data, A. DeMaris, John Wiley and Sons, 2004. Variables of interest for this assignment include:

• ay_salary = Academic Salary (per Year) • yrs_emp = Number of years of employment with the college • female = 1 if female, 0 if male • fac_rank = Indicates whether faculty member is: an assistant professor, an associate professor, a full professor, or an instructor

fac_sal <- read.csv("fac_sal.csv")

What does the distribution of academic year salaries look like? Is the distribution symmetric? Right-skewed? Left-skewed? Hint: Please describe the data graphically through a Histogram.

hist(fac_sal$ay_salary, breaks = 28, col = "green", 
     main = "Academic Year Salaries",
     xlab = "$",
     ylab = "Frequency")

#Based on the histogram of academic year salaries, it appears that the distribution is right-skewed (aka positive-skewed).

How does the boxplot reflect what you noticed in the histogram? Are there faculty members who have particularly high salaries? Hint: Please describe the data graphically through a boxplot.

boxplot(fac_sal$ay_salary, data = fac_sal, col = "green", 
     main = "Academic Year Salaries", xlab = "faculty",  ylab = "$")

#Yes, the boxplot reflects the same skew I noticed in the histogram. There are a few faculty members who have extremely high salaries.

Based on parts (1) and (2), would you expect the mean or median academic salary to be larger? Which measure of central tendency do you think is better in this case? Hint: Please confirm your answer by descriptive summary, use the command summarize Variable name, detail; and by explaining the graphs, skewness and outliers effects.

#Due to the outliers, median should be a better measure of central tendency in this instance. The mean will undoubtedly be affected by the few faculty that have extremely high salaries, so I expect the mean to be larger than the median (confirmed by descriptive summary in R shown below)

summary(fac_sal$ay_salary)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   23800   36833   46728   47801   57573  103041

How variable are the academic salaries? Hint: Please give three measures for the vari- ability; Stand. Dev., Range, IQR.etc.

sd(fac_sal$ay_salary)

## [1] 13991.9

max(fac_sal$ay_salary) - min(fac_sal$ay_salary)

## [1] 79241

IQR(fac_sal$ay_salary)

## [1] 20740

# Considering that some of the lowest paid faculty member's salaries are in the ballpark 20k-30k, the three measures calculated show that there is indeed a tremendous variability of academic salaries at this institution.

Are there any outliers in the academic salaries? Hint: Please give an explanation from the graphs and do the mathematical calculations.

# Yes, there are outliers in the academic salaries. For example:

max(fac_sal$ay_salary)

## [1] 103041

# The highest paid faculty member received $103,041. This is an example of an outlier.

# The box plot also revealed the outliers as shown above. The IQR was previously calculated at $20,740. An outlying value (or an outlier) is a measurement/observation x such that either: x > 75th percentile + 1.5×IQR

quantile(fac_sal$ay_salary, c(.75))

##   75% 
## 57573

57573 + 1.5 * 20740

## [1] 88683

# Here are the outliers:

fac_sal$ay_salary[fac_sal$ay_salary > 88683]

## [1]  89789  90082  91405  91489  96156  96744 103041

What is the 20th percentile of the academic salaries?

quantile(fac_sal$ay_salary, (.20))

##   20% 
## 34731

How many males are in the data set? How many females are in the data set? Hint: Please use the tabulate command.

# The number of females in this data set:
length(fac_sal$female[fac_sal$female == "yes"])

## [1] 214

# The number of males in this data set:
length(fac_sal$female[fac_sal$female == "no"])

## [1] 511

# Alternatively,:
table(fac_sal$female)

## 
##  no yes 
## 511 214

What is the sex of the four faculty members with the highest academic salaries? Hint: Please sort the data and check for that.

# Using the top 4 salaries (91489, 96156, 96744, 103041) from previously identifying the outliers, I can check for the sex of each 4 of the top paid faculty members.

str(fac_sal$female[fac_sal$ay_salary == "103041"])

##  chr "no"

str(fac_sal$female[fac_sal$ay_salary == "96744"])

##  chr "no"

str(fac_sal$female[fac_sal$ay_salary == "96156"])

##  chr "no"

str(fac_sal$female[fac_sal$ay_salary == "91489"])

##  chr "no"

# As shown above, none of the 4 top-paid faculty members are female, which means that the sex of the all 4 faculty members with the highest academic salaries are MALE.

Compare the salaries of males and females by plotting boxplots by sex side-by-side. Does there appear to be a difference between males and females?

Male <- fac_sal$ay_salary[fac_sal$female == "no"]

Female <- fac_sal$ay_salary[fac_sal$female == "yes"]

boxplot(Male, Female, col=c("blue","pink"), 
     main = "Comparison of Male and Female Faculty Salaries", names=c("Male","Female"),  ylab = "$")

Yes, there appears to be a different between the salaries of male faculty versus female faculty.

OCS 283 Problem Set #1

Wilson Ng

10/24/2020

DEFINITIONS

CALCULATIONS

R