Create a vector named ‘z’ with 500 z values
z = rnorm(500)
Create a histogram of z
hist(z)
Print the mean and standard deviation of z on the console
info = summary(z)
print(paste("The mean of the zs is", info["Mean"], "and the standard deviation is", sd(z)))
## [1] "The mean of the zs is -0.01289 and the standard deviation is 0.986159994866081"
Create a logical vector that is ‘True’ for z values that are greater than 2 and ‘False’ otherwise.
z_logical = z > 2
Print the frequencies of True and False values.
prop.table(table(z_logical))
## z_logical
## FALSE TRUE
## 0.974 0.026
Run the following 8 lines of code to create 8 vector objects. Print each vector, and make sure you understand how each assignment works.
student.name <- letters[1:20]
gender <- c('m','f','f','m','f','m','m','m','f','f','m','f','m','m','f','f','f','f','m','m')
homework1 <- c(rep(7,5),rep(8,5),rep(9,5),rep(10,5))
homework2 <- c(9.5,rep(10,19))
homework3 <- c(rep(10,10),rep(8,3),rep(8,6),0)
homework4 <- c(8,9,9,9,8,10,10,10,10,7,9,9,8,9,5,rep(10,5))
homework5 <- c(rep(5,5),10,rep(9,14))
project <- c(20,30,30,30,0,30,25,25,30,30,30,30,15,25,25,25,20,30,28,20)
Create a data frame named class.grades from 7 of the objects created above: gender, homework1, ..., homework5, project (all except student.name)
class.grades = data.frame(gender,homework1,homework2,homework3,homework4,homework5, project)
Have R print the dimensions of class.grades (how many rows and columns)?
dimensions = dim(class.grades)
print(paste(dimensions[1], "rows and", dimensions[2], "columns"))
## [1] "20 rows and 7 columns"
Label the rows with the vector student.name
(i.e. each row is given a single-letter name)
rownames(class.grades) = student.name
Print just the first few lines of class.grades.
head(class.grades)
## gender homework1 homework2 homework3 homework4 homework5 project
## a m 7 9.5 10 8 5 20
## b f 7 10.0 10 9 5 30
## c f 7 10.0 10 9 5 30
## d m 7 10.0 10 9 5 30
## e f 7 10.0 10 8 5 0
## f m 8 10.0 10 10 10 30
Print the entire class.grades data frame.
class.grades
## gender homework1 homework2 homework3 homework4 homework5 project
## a m 7 9.5 10 8 5 20
## b f 7 10.0 10 9 5 30
## c f 7 10.0 10 9 5 30
## d m 7 10.0 10 9 5 30
## e f 7 10.0 10 8 5 0
## f m 8 10.0 10 10 10 30
## g m 8 10.0 10 10 9 25
## h m 8 10.0 10 10 9 25
## i f 8 10.0 10 10 9 30
## j f 8 10.0 10 7 9 30
## k m 9 10.0 8 9 9 30
## l f 9 10.0 8 9 9 30
## m m 9 10.0 8 8 9 15
## n m 9 10.0 8 9 9 25
## o f 9 10.0 8 5 9 25
## p f 10 10.0 8 10 9 25
## q f 10 10.0 8 10 9 20
## r f 10 10.0 8 10 9 30
## s m 10 10.0 8 10 9 28
## t m 10 10.0 0 10 9 20
Print summary statistics on class.grades (using one command)
summary(class.grades)
## gender homework1 homework2 homework3 homework4
## f:10 Min. : 7.00 Min. : 9.500 Min. : 0.0 Min. : 5.00
## m:10 1st Qu.: 7.75 1st Qu.:10.000 1st Qu.: 8.0 1st Qu.: 8.75
## Median : 8.50 Median :10.000 Median : 9.0 Median : 9.00
## Mean : 8.50 Mean : 9.975 Mean : 8.6 Mean : 9.00
## 3rd Qu.: 9.25 3rd Qu.:10.000 3rd Qu.:10.0 3rd Qu.:10.00
## Max. :10.00 Max. :10.000 Max. :10.0 Max. :10.00
## homework5 project
## Min. : 5.00 Min. : 0.00
## 1st Qu.: 8.00 1st Qu.:23.75
## Median : 9.00 Median :26.50
## Mean : 8.05 Mean :24.90
## 3rd Qu.: 9.00 3rd Qu.:30.00
## Max. :10.00 Max. :30.00
Use the following call to the boxplot() function to see a distribution of homework grades
boxplot(class.grades$homework1,class.grades$homework2,
class.grades$homework3,class.grades$homework4,class.grades$homework5,
names=c('#1','#2','#3','#4','#5'),
main="Homework Grades",cex.sub=.7,
sub="Scores range from 0 (lowest possible grade) to 10 (highest possible grade)")
From the plot, which homework appears to be the easiest?:
Homework #2 appears to have been the easiest given that the lowest grade was higher than the median grade for all the other assignments.
Task 3 The following questions deal with the ‘esoph’ dataset from the R package named ‘datasets’. NOTE: This package is included in the base package, so everyone should already have it installed.
Load the ‘datasets’ library and select the ‘esoph’ dataset using the following two commands:
library(datasets)
attach(esoph)
Run help(esoph) to see a description of the data set.
Have R print the dimensions of the esoph dataset (how many rows and columns). HINT: ‘esoph’ is a data frame object, so you can subset it using brackets [] or $ and manipulate it as you would the class.grades object you created in Task 2.
dimensions = dim(esoph)
print(paste(dimensions[1], "rows and", dimensions[2], "columns"))
## [1] "88 rows and 5 columns"
Have R list the column names in the esoph dataset.
colnames(esoph)
## [1] "agegp" "alcgp" "tobgp" "ncases" "ncontrols"
Have R Show the first 6 rows of the entire dataset.
head(esoph)
## agegp alcgp tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day 0 40
## 2 25-34 0-39g/day 10-19 0 10
## 3 25-34 0-39g/day 20-29 0 6
## 4 25-34 0-39g/day 30+ 0 5
## 5 25-34 40-79 0-9g/day 0 27
## 6 25-34 40-79 10-19 0 7
Have R show the first 6 values of the vector of ncontrols.
head(ncontrols)
## [1] 40 10 6 5 27 7
Have R show how many unique values of agegp there are.
Repeat with tobgp and alcgp.
toSee = c("agegp", "tobgp", "alcgp")
for(item in toSee){
uniq = unique(esoph[[item]])
print(paste("There are", length(uniq), "unique items in", item ))
}
## [1] "There are 6 unique items in agegp"
## [1] "There are 4 unique items in tobgp"
## [1] "There are 4 unique items in alcgp"
Have R calculate a ratio of ncases/ncontrols.
case_to_control = ncases/ncontrols
Have R draw a histogram of the ratios.
hist(case_to_control)
The symbol ~ represents the concept ‘explained by’ in R.
For example, to do a boxplot of heights for males and females separately, you could do a boxplot of height ~ gender, that is, height ‘explained by’ gender.
Using esoph, have R draw a boxplot of ratio by agegp. Include a title on your boxplot.
Next, have R draw a boxplot of ratio by alcgp. Include a title on your boxplot.
Finally, have R draw a boxplot of ratio by tobgp. Include a title on your boxplot.
par(mfrow=c(1,3)) #Place the plots side by side to quickly assess differences
boxplot(case_to_control ~ agegp,main="Case to Control by Age")
boxplot(case_to_control ~ alcgp,main="Case to Control by Alcohol Consumption")
boxplot(case_to_control ~ tobgp,main="Case to Control by Tobaco Consumption")
State in a comment: Based on the boxplots, which factors (age, alcohol consumption, tobacco) seem to have the most impact on esophageal cancer? Explain briefly:
I would say that the biggest impact comes from alcohol consumption. It has a more exponential effect than say age that seems to have a rapid jump at 45-54 but then remains relatively steady.
Also, alcohol consumption is the only factor that has always increasing medians at every grouping.