Task 1

Create a vector named ‘z’ with 500 z values

z = rnorm(500)

Create a histogram of z

hist(z)

Print the mean and standard deviation of z on the console

info = summary(z)
print(paste("The mean of the zs is", info["Mean"], "and the standard deviation is", sd(z)))
## [1] "The mean of the zs is -0.01289 and the standard deviation is 0.986159994866081"

Create a logical vector that is ‘True’ for z values that are greater than 2 and ‘False’ otherwise.

z_logical = z > 2

Print the frequencies of True and False values.

prop.table(table(z_logical))
## z_logical
## FALSE  TRUE 
## 0.974 0.026

Task 2

Run the following 8 lines of code to create 8 vector objects. Print each vector, and make sure you understand how each assignment works.

student.name <- letters[1:20]
gender <- c('m','f','f','m','f','m','m','m','f','f','m','f','m','m','f','f','f','f','m','m')
homework1 <- c(rep(7,5),rep(8,5),rep(9,5),rep(10,5))
homework2 <- c(9.5,rep(10,19))
homework3 <- c(rep(10,10),rep(8,3),rep(8,6),0)
homework4 <- c(8,9,9,9,8,10,10,10,10,7,9,9,8,9,5,rep(10,5))
homework5 <- c(rep(5,5),10,rep(9,14))
project <- c(20,30,30,30,0,30,25,25,30,30,30,30,15,25,25,25,20,30,28,20)

Create a data frame named class.grades from 7 of the objects created above: gender, homework1, ..., homework5, project (all except student.name)

class.grades = data.frame(gender,homework1,homework2,homework3,homework4,homework5, project)

Have R print the dimensions of class.grades (how many rows and columns)?

dimensions = dim(class.grades)
print(paste(dimensions[1], "rows and", dimensions[2], "columns"))
## [1] "20 rows and 7 columns"

Label the rows with the vector student.name

(i.e. each row is given a single-letter name)

rownames(class.grades) = student.name

Print just the first few lines of class.grades.

head(class.grades)
##   gender homework1 homework2 homework3 homework4 homework5 project
## a      m         7       9.5        10         8         5      20
## b      f         7      10.0        10         9         5      30
## c      f         7      10.0        10         9         5      30
## d      m         7      10.0        10         9         5      30
## e      f         7      10.0        10         8         5       0
## f      m         8      10.0        10        10        10      30

Print the entire class.grades data frame.

class.grades
##   gender homework1 homework2 homework3 homework4 homework5 project
## a      m         7       9.5        10         8         5      20
## b      f         7      10.0        10         9         5      30
## c      f         7      10.0        10         9         5      30
## d      m         7      10.0        10         9         5      30
## e      f         7      10.0        10         8         5       0
## f      m         8      10.0        10        10        10      30
## g      m         8      10.0        10        10         9      25
## h      m         8      10.0        10        10         9      25
## i      f         8      10.0        10        10         9      30
## j      f         8      10.0        10         7         9      30
## k      m         9      10.0         8         9         9      30
## l      f         9      10.0         8         9         9      30
## m      m         9      10.0         8         8         9      15
## n      m         9      10.0         8         9         9      25
## o      f         9      10.0         8         5         9      25
## p      f        10      10.0         8        10         9      25
## q      f        10      10.0         8        10         9      20
## r      f        10      10.0         8        10         9      30
## s      m        10      10.0         8        10         9      28
## t      m        10      10.0         0        10         9      20

Print summary statistics on class.grades (using one command)

summary(class.grades)
##  gender   homework1       homework2        homework3      homework4    
##  f:10   Min.   : 7.00   Min.   : 9.500   Min.   : 0.0   Min.   : 5.00  
##  m:10   1st Qu.: 7.75   1st Qu.:10.000   1st Qu.: 8.0   1st Qu.: 8.75  
##         Median : 8.50   Median :10.000   Median : 9.0   Median : 9.00  
##         Mean   : 8.50   Mean   : 9.975   Mean   : 8.6   Mean   : 9.00  
##         3rd Qu.: 9.25   3rd Qu.:10.000   3rd Qu.:10.0   3rd Qu.:10.00  
##         Max.   :10.00   Max.   :10.000   Max.   :10.0   Max.   :10.00  
##    homework5        project     
##  Min.   : 5.00   Min.   : 0.00  
##  1st Qu.: 8.00   1st Qu.:23.75  
##  Median : 9.00   Median :26.50  
##  Mean   : 8.05   Mean   :24.90  
##  3rd Qu.: 9.00   3rd Qu.:30.00  
##  Max.   :10.00   Max.   :30.00

Use the following call to the boxplot() function to see a distribution of homework grades

boxplot(class.grades$homework1,class.grades$homework2,
        class.grades$homework3,class.grades$homework4,class.grades$homework5, 
        names=c('#1','#2','#3','#4','#5'),  
        main="Homework Grades",cex.sub=.7,
        sub="Scores range from 0 (lowest possible grade) to 10 (highest possible grade)")

From the plot, which homework appears to be the easiest?:

Homework #2 appears to have been the easiest given that the lowest grade was higher than the median grade for all the other assignments.


Task 3:

Task 3 The following questions deal with the ‘esoph’ dataset from the R package named ‘datasets’. NOTE: This package is included in the base package, so everyone should already have it installed.

Load the ‘datasets’ library and select the ‘esoph’ dataset using the following two commands:

library(datasets)
attach(esoph)

Run help(esoph) to see a description of the data set.

Have R print the dimensions of the esoph dataset (how many rows and columns). HINT: ‘esoph’ is a data frame object, so you can subset it using brackets [] or $ and manipulate it as you would the class.grades object you created in Task 2.

dimensions = dim(esoph)
print(paste(dimensions[1], "rows and", dimensions[2], "columns"))
## [1] "88 rows and 5 columns"

Have R list the column names in the esoph dataset.

colnames(esoph)
## [1] "agegp"     "alcgp"     "tobgp"     "ncases"    "ncontrols"

Have R Show the first 6 rows of the entire dataset.

head(esoph)
##   agegp     alcgp    tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day      0        40
## 2 25-34 0-39g/day    10-19      0        10
## 3 25-34 0-39g/day    20-29      0         6
## 4 25-34 0-39g/day      30+      0         5
## 5 25-34     40-79 0-9g/day      0        27
## 6 25-34     40-79    10-19      0         7

Have R show the first 6 values of the vector of ncontrols.

head(ncontrols)
## [1] 40 10  6  5 27  7

Have R show how many unique values of agegp there are.
Repeat with tobgp and alcgp.

toSee = c("agegp", "tobgp", "alcgp")
for(item in toSee){
  uniq = unique(esoph[[item]])
  print(paste("There are", length(uniq), "unique items in", item ))
}
## [1] "There are 6 unique items in agegp"
## [1] "There are 4 unique items in tobgp"
## [1] "There are 4 unique items in alcgp"

Have R calculate a ratio of ncases/ncontrols.

case_to_control = ncases/ncontrols

Have R draw a histogram of the ratios.

hist(case_to_control)

The symbol ~ represents the concept ‘explained by’ in R.
For example, to do a boxplot of heights for males and females separately, you could do a boxplot of height ~ gender, that is, height ‘explained by’ gender.

Using esoph, have R draw a boxplot of ratio by agegp. Include a title on your boxplot.

Next, have R draw a boxplot of ratio by alcgp. Include a title on your boxplot.

Finally, have R draw a boxplot of ratio by tobgp. Include a title on your boxplot.

par(mfrow=c(1,3)) #Place the plots side by side to quickly assess differences
boxplot(case_to_control ~ agegp,main="Case to Control by Age")

boxplot(case_to_control ~ alcgp,main="Case to Control by Alcohol Consumption")

boxplot(case_to_control ~ tobgp,main="Case to Control by Tobaco Consumption")

State in a comment: Based on the boxplots, which factors (age, alcohol consumption, tobacco) seem to have the most impact on esophageal cancer? Explain briefly:

I would say that the biggest impact comes from alcohol consumption. It has a more exponential effect than say age that seems to have a rapid jump at 45-54 but then remains relatively steady.

Also, alcohol consumption is the only factor that has always increasing medians at every grouping.