OCS 283 HW 1

Calculations

The Table 1 displays the forced expiratory volumes in 1 second for 13 adolescents suffering from asthma. BY HAND calculate the mean, the median, variance and standard deviation.

Mean = (2.30+ 2.15 + 3.50 + 2.60 + 2.75 + 2.82 + 4.05 + 2.25 + 2.68 + 3.00 +4.02 + 2.85 +3.38)/13 = 2.95L
Median = (n+1)/2 = 14/2 = 7th number is the median → 2.82L

2.15, 2.25, 2.30, 2.60, 2.68, 2.75, 2.82, 2.85, 3.00, 3.38, 3.50, 4.02, 4.05

Sample Variance = [(2.30-2.95)^2 + (2.15-2.95)^2 + (3.50-2.95)^2 + (2.60-2.95)^2 + (2.75-2.95)^2 + (2.82-2.95)^2 + (4.05-2.95)^2 + (2.25-2.95)^2 + (2.68-2.95)^2 + (3.00-2.95)^2 + (4.02 -2.95)^2 + (2.85 -2.95)^2 + (3.38-2.95)^2]/ (13-1) = 0.39L^2
Standard Deviation: sqrt of 0.39 = 0.62L

The declared concentrations of tar for 35 brands of Canadian cigarettes are saved under the variable name tar in the data set lab2.csv (it is posted on the course website). Open this data in Stata, and use it to answer the following questions. Use the R commands from last weeks Introduction to R to answer parts a-c.

lab2 = read.csv("lab2.csv")

Find the mean and standard deviation of the concentrations of tar.Note: Don’t forget the units (check the R commands: summary, mean, and sd)

mean(lab2$tar)

## [1] 11.50286

sd(lab2$tar)

## [1] 5.299416

Find the median, range, and interquartile range of tar.

median(lab2$tar)

## [1] 13

max(lab2$tar)-min(lab2$tar)

## [1] 18.3

IQR(lab2$tar)

## [1] 6.5

Using R, produce a histogram of the tar measurements. Remember to use the frequency option. Describe the shape of the values.

hist(lab2$tar, breaks = 6, col = "black",main = "Tar Measurements",  xlab = "Values", ylab = "Frequency")

The data is skewed to the left.

Which number (mean or median) do you think provides the best measure of central tendency? Why?

The median is most likely the best measure of central tendency. This is because the median is not as largely affected by outliers that have very large or small numbers.

For this question, the dataset we will use is f ac sal.csv (posted on the course website). This dataset contains information about faculty salaries from Bowling Green State University. The data are from Regression with Social Data, A. DeMaris, John Wiley and Sons, 2004. Variables of interest for this assignment include: 1 • ay salary = Academic Salary (per Year) • yrs emp = Number of years of employment with the college • female = 1 if female, 0 if male • f ac rank = Indicates whether faculty member is: an assistant professor, an associate professor, a full professor, or an instructor

fac_sal <- read.csv("fac_sal.csv")

a)What does the distribution of academic year salaries look like? Is the distribution symmetric? Right-skewed? Left-skewed? Hint: Please describe the data graphically through a Histogram.

hist(fac_sal$ay_salary, breaks = 28, col = "blue", main = "Academic Year Salaries",xlab = "$",ylab = "Frequency")

This data appears to be asymmetric and right skewed.

b). How does the boxplot reflect what you noticed in the histogram? Are there faculty members who have particularly high salaries?Hint: Please describe the data graphically through a boxplot.

boxplot(fac_sal$ay_salary, data=fac_sal, col= "blue", main= "Academic Year Salaries", xlab="faculty", ylab="$")

The data from the boxplot confirms the dat from the histogram. There are a few faculty who have particularly high salaries.

Based on parts (1) and (2), would you expect the mean or median academic salary to be larger? Which measure of central tendency do you think is better in this case? Hint: Please confirm your answer by descriptive summary, use the command summary, mean, sd, median commands.

The median would better represent the central tendency due to the large outliers in this case. Therefore I would expect the mean academic salary to be larger.

summary(fac_sal$ay_salary)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   23800   36833   46728   47801   57573  103041

d)How variable are the academic salaries? Hint: Please give three measures for the variability; Stand. Dev., Range, IQR.etc.

sd(fac_sal$ay_salary)

## [1] 13991.9

max(fac_sal$ay_salary-min(fac_sal$ay_salary))

## [1] 79241

IQR(fac_sal$ay_salary)

## [1] 20740

These values demonstrate that there is a large degree of variability amognst the academic salaries.

e)Are there any outliers in the academic salaries? Hint: Please give an explanation from the graphs and do the mathematical calculations.

Yes there are outliers in the academic salaries.

Outliers are measurements that are greater than the 75th percentile plus 1.5xIQR

quantile(fac_sal$ay_salary, c(.75))

##   75% 
## 57573

57573 + 1.5*20740

## [1] 88683

fac_sal$ay_salary[fac_sal$ay_salary>88683]

## [1]  89789  90082  91405  91489  96156  96744 103041

The values above are outliers

f)What is the 20th percentile of the academic salaries? Hint: Use the quantile command

quantile(fac_sal$ay_salary, (.20))

##   20% 
## 34731

How many males are in the data set? How many females are in the data set? Hint: Please use the ”table” command.

Females

length(fac_sal$female[fac_sal$female== "yes"])

## [1] 214

Males

length(fac_sal$female[fac_sal$female == "no"])

## [1] 511

table(fac_sal$female)

## 
##  no yes 
## 511 214

h)What is the sex of the four faculty members with the highest academic salaries? Hint: Please sort the data using the sort command and check for that.

str(fac_sal$female[fac_sal$ay_salary == "103041"])

##  chr "no"

str(fac_sal$female[fac_sal$ay_salary == "96744"])

##  chr "no"

str(fac_sal$female[fac_sal$ay_salary == "96156"])

##  chr "no"

str(fac_sal$female[fac_sal$ay_salary == "91489"])

##  chr "no"

The sex of the highest 4 salary earners are male as demonstrated by the “no” response to female.

Compare the salaries of males and females by plotting boxplots by sex side-by-side. Does there appear to be a difference between males and females?

Male = fac_sal$ay_salary[fac_sal$female == "no"]
Female = fac_sal$ay_salary[fac_sal$female == "yes"]

boxplot(Male, Female, col=c("blue", "pink"), main = "Comparison of male and female salaries", names=c ("Male","Female"), ylab = "$")

It appears as though the salaries of males are higher than females in this data set.

OCS 283 HW 1

Stephen Kasper

10/27/2020

Definitions

Calculations