Data visualization is an essential first step in analysis. Visualizing data can help determine whether certain statistical tests are appropriate, or whether there are systematic issues in your data.
Plotting in R spans the range from very simple to very complex. There are options to add numerous arguments to the basic plot() function. Beyond that, there are additional functions that layer on further data or other information (labels, lines, legend, etc.) once a plot has been established.
Today, we’ll be working with a dataset that is designed to mimic data resulting from a clinical trial. There are a mix of categorical and numerical variables, which means that a variety of plots could be made to visualize the data.
Let’s read in the data and practice some new functions and plotting options. We’ll start with a simple dataset with three categorical variables. The variables indicate the sex of the patient, whether they received a drug or a placebo, and whether they got sick or did not get sick:
# Download the data
d1 <- read.csv("https://raw.githubusercontent.com/tgouhier/biostats/main/lab1_dataset1.csv")
# Recall that head() shows the first few lines of the dataset
head(d1)
## sex treatment outcome
## 1 male placebo sick
## 2 male placebo sick
## 3 male placebo sick
## 4 male placebo not sick
## 5 male placebo sick
## 6 male placebo sick
We need to convert these variables into counts in order to do any further analyses. The table() function will do that for you!
# table() on the full dataset will create a summary of all counts. Note that this does not create a new object, but simply performs the calculation!
table(d1)
## , , outcome = not sick
##
## treatment
## sex drug placebo
## female 36 35
## male 27 4
##
## , , outcome = sick
##
## treatment
## sex drug placebo
## female 4 5
## male 13 36
# You can also create a table of a subset of the data
# You can do operations within a function!
table(d1[, c("treatment", "outcome")])
## outcome
## treatment not sick sick
## drug 63 17
## placebo 39 41
Let’s read in the second dataset, which uses a numerical variable for treatment level instead of a categorical variable for drug vs. placebo.
# Download the next dataset
d2 <- read.csv("https://raw.githubusercontent.com/tgouhier/biostats/main/lab1_dataset2.csv")
# I'm going to add another column to the data to show you some examples of how to manipulate the dataset.
# The added column will be a random vector of ages. The next 2 lines of code add the "age" column:
## YOU DO NOT NEED TO USE THESE FUNCTIONS FOR THIS LAB
set.seed(2501)
d2$age <- round(runif(dim(d2)[1], 20, 80))
# Example of adding a new column with all 1s
d2$newcolumn <- 1
# Look at the content of this dataset
head(d2)
## sex treatment outcome side.effects age newcolumn
## 1 male 1 sick yes 47 1
## 2 male 1 sick no 29 1
## 3 male 1 sick yes 59 1
## 4 male 1 not sick no 61 1
## 5 male 1 sick no 57 1
## 6 male 1 sick no 77 1
Imagine you want to check your experimental design, and make sure that there are no systematic age differences among treatment levels. What steps would you need to implement this?
The data you want to end up with is the mean age for each treatment level. This sounds a lot like a problem that aggregate() could solve! We will first look at splitting age by treatment, and then by treatment and sex:
# Calculate the mean age by treatment level
mean.age.all <- aggregate(age ~ treatment, FUN = mean, data = d2)
print(mean.age.all)
## treatment age
## 1 1 49.9125
## 2 2 49.4625
## 3 3 48.6750
## 4 4 50.5000
## 5 5 50.3125
## 6 6 49.9750
## 7 7 49.8375
## 8 8 49.7125
## 9 9 50.3625
## 10 10 53.7375
# What if you wanted to split out mean age by combinations of treatment and sex?
mean.age.sex <- aggregate(age ~ treatment + sex, FUN = mean, data = d2)
print(mean.age.sex)
## treatment sex age
## 1 1 female 48.550
## 2 2 female 46.850
## 3 3 female 48.175
## 4 4 female 48.625
## 5 5 female 50.525
## 6 6 female 53.975
## 7 7 female 49.250
## 8 8 female 47.200
## 9 9 female 52.350
## 10 10 female 50.575
## 11 1 male 51.275
## 12 2 male 52.075
## 13 3 male 49.175
## 14 4 male 52.375
## 15 5 male 50.100
## 16 6 male 45.975
## 17 7 male 50.425
## 18 8 male 52.225
## 19 9 male 48.375
## 20 10 male 56.900
Ages look pretty consistent, but it would still be better to plot the data.
Think about what would make a good plot for this dataframe. It would be nice to see one line plot for the age of females across the 10 treatment levels, and one line plot in a different color for the age of males across the 10 treatment levels. To implement this, first, we need to split up the dataset into two dataframes, which correspond to females and males:
# Use logical evaluation to keep the rows that say "male"
ages.males <- mean.age.sex[mean.age.sex$sex == "male", ]
# Create a dataframe called "ages.female" that keeps observation for females
ages.females <- mean.age.sex[mean.age.sex$sex == "female", ]
Then, we can plot the two datasets:
# Time to plot. Either dataset can go first:
plot(ages.males$treatment, ages.males$age,
type = "l",
lwd = 1.5,
col = "purple",
xlab = "Treatment Level",
ylab = "Mean Age")
# Instead of using plot(), which creates a new figure, we will now use points(), which simply ADDS data to an existing figure
# You MUST run this whole code chunk, or highlight the whole section, or else you will get an error
points(ages.females$treatment, ages.females$age,
type = "l",
lwd = 1.5,
col = "blue")
# Finally, we need a legend to distinguish the two datasets
legend("topleft",
legend = c("Males", "Females"),
col = c("purple", "blue"),
lwd = 1.5)
There are a lot of new functions and arguments we just used, so let’s take a closer look. When using points(), you do NOT need to specify any arguments like axis labels. This is because the function simply ADDS on data to a plot that already exists. Another new argument within plot() was lwd, which specifies line width. We increased it a bit from its default of 1.
Then, we added a legend. The first argument for legend() is the position of the legend (options include “top”, “topleft”, “topright”, “bottom”, etc). The second argument is the labels for the legend, in vector form. The third argument is the colors for each of the labels. The last argument is lwd, which tells the function to both make the symbol a line and to make it a width of 1.5.
Here is a reminder of other commonly used plotting arguments. These will work within plot(), points(), and legend():
We will look at one more example before you try on your own.
Imagine that what you are really concerned about is the fraction of people over 65, rather than the mean age of each treatment level group. How can you look at what fraction of each group is over 65? Begin by counting the entries in a column or vector that are greater than 65.
Recall that logical operators such as <, <=, ==, >=, and > allow you to compare a variable to a specified value. For example:
# Check if the age column is over 65
d2$age > 65
## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [25] FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE
## [37] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE
## [61] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [97] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
## [109] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
## [121] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [133] TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
## [157] TRUE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
## [169] TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
## [181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [193] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [205] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [217] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [229] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [241] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
## [253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## [277] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE
## [289] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [301] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [313] TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [337] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [349] FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
## [361] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
## [373] TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
## [385] FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE
## [397] TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## [409] FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [421] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
## [433] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [445] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
## [457] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [469] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [481] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [493] FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
## [505] FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [517] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [529] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
## [541] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [553] FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE
## [565] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
## [577] FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [589] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
## [601] TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
## [613] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE
## [625] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [637] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [649] FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
## [661] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [673] FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [685] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [697] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
## [709] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [721] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [733] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
## [745] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [757] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [769] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [781] TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
## [793] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
# Count how many values in the age column are over 65
sum(d2$age > 65)
## [1] 197
# Count the number of values in the age column that are EXACTLY 65
sum(d2$age == 65)
## [1] 12
# Count the fraction of values in the age column that are over 65
sum(d2$age > 65) / length(d2$age)
## [1] 0.24625
# Calculate the fraction of patients who are male:
sum(d2$sex == "male") / length(d2$sex)
## [1] 0.5
We now want to check what fraction of entries in this column are over 65 for each combination of the treatment and sex variables. The first step is making a generic function to count the fraction of entries over 65 within a column or vector. Then, use aggregate() to apply this function to each variable combination:
# We want to input a vector and have the output be a proportion (number) of the entries that are greater than 65
prop65 <- function(vector){
prop.over <- sum(vector > 65) / length(vector)
return(prop.over)
}
# Now, we can use this function with aggregate() to apply this algorithm by treatment level and age:
prop.65.sex <- aggregate(age ~ treatment + sex, FUN = prop65, data = d2)
(prop.65.sex)
## treatment sex age
## 1 1 female 0.225
## 2 2 female 0.150
## 3 3 female 0.225
## 4 4 female 0.225
## 5 5 female 0.225
## 6 6 female 0.250
## 7 7 female 0.275
## 8 8 female 0.175
## 9 9 female 0.325
## 10 10 female 0.225
## 11 1 male 0.250
## 12 2 male 0.250
## 13 3 male 0.225
## 14 4 male 0.275
## 15 5 male 0.275
## 16 6 male 0.175
## 17 7 male 0.250
## 18 8 male 0.300
## 19 9 male 0.200
## 20 10 male 0.425
# Again, we now split the data into 2 dataframes. This time, we can use a different type of subsetting that is faster but less powerful. It keeps all rows where the entry for the "sex" variable matches a specified input, indicated by the double equals symbol (==) :
prop.65.males <- prop.65.sex[prop.65.sex$sex == "male", ]
prop.65.females <- prop.65.sex[prop.65.sex$sex == "female", ]
Take a close look at the custom function, since you will be asked to create something similar later. The code “sum(vector > 65)” counts how many entries in “vector” are over 65. The function then divides this sum by the total number of entries, calculated by “length(vector)” in order to arrive at the desired proportion.
Again, we can make the same kind of plot where we overlay the datasets for males and females. The final product should have treatment level on the x-axis and the proportion of people over 65 on the y-axis.
# Time to plot. Let's make a plot with both points and lines this time. Can you spot the change?
plot(ages.males$treatment, prop.65.males$age,
type = "b",
lwd = 1.5,
col = "purple",
xlab = "Treatment Level",
ylab = "Proportion > 65",
ylim = c(.15, .45))
# Instead of using plot(), which creates a new figure, we will now use points(), which simply ADDS data to an existing figure
# You MUST run this whole code chunk, or highlight the whole section, or else you will get an error
points(ages.females$treatment, prop.65.females$age,
type = "b",
lwd = 1.5,
col = "blue")
# Finally, we need a legend to distinguish the two datasets
legend("topleft",
legend = c("Males", "Females"),
col = c("purple", "blue"),
lwd = 1.5)
From this plot, one could say that the fraction of individuals over 65 is consistent for most groups with the exception of treatment level 10 for males. This might be concerning if you had a hypothesis about why old age might confound your study! Remember that these data are totally random – this is a good example of how even random data can show spurious signal if you look at enough groups!
Finally, it’s also important to note that points() will NOT adjust the area of the plot to make sure all the added data are shown. Thus, before creating the initial plot, you must determine the appropriate limits for your x and y axes using the xlim and ylim arguments, respectively. This has been done in the above figure.
You will investigate the dataset called “d1” that we looked at earlier. This dataset simulates the outcome of a clinical trial studying the effectiveness of an experimental drug treatment for the prevention of malaria. Read in the data again, in case you made any unintentional changes:
# Download the data
d1 <- read.csv("https://raw.githubusercontent.com/tgouhier/biostats/main/lab1_dataset1.csv")
To test the effectiveness of the drug, you have collected data on the occurrence of malaria in male and female test subjects treated with a placebo (sugar pill) vs. the drug. Inspect the dataset using the str() and head() functions. What are the names and the types (e.g., numerical, categorical) of the variables in the dataset?
str(d1)
## 'data.frame': 160 obs. of 3 variables:
## $ sex : chr "male" "male" "male" "male" ...
## $ treatment: chr "placebo" "placebo" "placebo" "placebo" ...
## $ outcome : chr "sick" "sick" "sick" "not sick" ...
head(d1)
## sex treatment outcome
## 1 male placebo sick
## 2 male placebo sick
## 3 male placebo sick
## 4 male placebo not sick
## 5 male placebo sick
## 6 male placebo sick
YOUR ANSWER:
sex would be a categorical nominal variable | treatment would be a categorical ordinal variable | outcome would be a categorical ordinal variable
Use the table() function to generate the same contingency table from the dataset that we saw above. Do males and females seem equally susceptible to malaria? Does the drug seem effective and if so, is its effectiveness contingent upon the sex of the subject?
table(d1)
## , , outcome = not sick
##
## treatment
## sex drug placebo
## female 36 35
## male 27 4
##
## , , outcome = sick
##
## treatment
## sex drug placebo
## female 4 5
## male 13 36
YOUR ANSWER:
According to the data, female patients seem to have relatively equivalent statistics between the two treatment types, with the drug and placebo having values within one patient of each other for the two outcomes, 36 (drug) and 35 (placebo) for not sick, 4 (drug) and 5 (placebo) for sick. Male patients seem to be better treated by the drug considering the majority of placebo patients (36) remained sick with a small fraction (4) being not sick at the end. While there was a larger amount of male drug patients that remained sick at the end of the study (13), the majority of male drug patients were not sick at the end of the study (27).
Determine the proportion of sick subjects for each combination of sex and treatment.
Step 1: You will need to create a custom function that calculates the proportion of outcomes that are “sick”. Modify the custom function we created in the introduction to perform this calculation. Instead of counting the number of entries that are over 65, you must count the number of entries that say “sick”. Remember to name your function something new when you modify it!
Step 2: Use your custom function within aggregate() to produce the desired proportions.
prop.sick <- function(x) {
y <- sum(x == "sick") / length(x)
return(y)
}
prop.sick.st <- aggregate(outcome ~ sex + treatment, FUN = prop.sick, data = d1)
head(prop.sick.st)
## sex treatment outcome
## 1 female drug 0.100
## 2 male drug 0.325
## 3 female placebo 0.125
## 4 male placebo 0.900
Is it easier to see patterns in the proportions or the contingency table?
As a doctor, would you recommend this drug to patients where malaria is common?
YOUR ANSWER:
Yes it is easier to see the patterns in the proportions. If I was a doctor I would recommend this drug to male patients if there is nothing better since it actually works well according to the data. But this data doesn’t seem great so I would likely not if there are any other options.
You decide to forge ahead and determine the optimal dosage of the drug. Instead of a binary variable (drug vs. placebo), there are now quantitative measurements of how much drug is given. Read in the dataset again:
# Download the data
d2 <- read.csv("https://raw.githubusercontent.com/tgouhier/biostats/main/lab1_dataset2.csv")
As before, use the str() and head() functions to inspect the dataset. What are the names and the types of the variables in the dataset?
head(d2)
## sex treatment outcome side.effects
## 1 male 1 sick yes
## 2 male 1 sick no
## 3 male 1 sick yes
## 4 male 1 not sick no
## 5 male 1 sick no
## 6 male 1 sick no
str(d2)
## 'data.frame': 800 obs. of 4 variables:
## $ sex : chr "male" "male" "male" "male" ...
## $ treatment : int 1 1 1 1 1 1 1 1 1 1 ...
## $ outcome : chr "sick" "sick" "sick" "not sick" ...
## $ side.effects: chr "yes" "no" "yes" "no" ...
YOUR ANSWER:
sex would be a categorical nominal variable | treatment would be a numerical discrete variable | outcome would be a | categorical ordinal variable | side.effects would be a categorical nominal variable
Use the aggregate() function to determine the proportion of sick male and female subjects in each dosage group.
Save your aggregated dataframe as a new object (with a new name) to use again later!
prop.sick.mf <- aggregate(outcome ~ sex + treatment, FUN = prop.sick, data = d2)
prop.sick.mf
## sex treatment outcome
## 1 female 1 0.350
## 2 male 1 0.900
## 3 female 2 0.150
## 4 male 2 0.800
## 5 female 3 0.200
## 6 male 3 0.875
## 7 female 4 0.275
## 8 male 4 0.900
## 9 female 5 0.175
## 10 male 5 0.525
## 11 female 6 0.250
## 12 male 6 0.275
## 13 female 7 0.200
## 14 male 7 0.225
## 15 female 8 0.150
## 16 male 8 0.225
## 17 female 9 0.150
## 18 male 9 0.375
## 19 female 10 0.225
## 20 male 10 0.275
Consider whether you can see patterns when there are this many categories. This is where plotting is essential!
Males seem to feel effects in the higher treatment levels, females remain constant.
Subset the aggregated data to two dataframes containing observations of males and females. You will need to create 2 new objects.
Then, use the plot() function to make a scatterplot of the proportion of sick male (in blue) and female (in red) subjects as a function of drug dosage (x-axis). You will need the points() function to add the second dataset. Do not forget to add a legend and label the axes.
Refer to the example in the introduction for a starting point! Recall that you can find information on arguments for functions by searching in the help window or typing help(functionName) into the Console!
prop.sick.mf.f <- prop.sick.mf[prop.sick.mf$sex == "female",]
prop.sick.mf.m <- prop.sick.mf[prop.sick.mf$sex == "male",]
plot(prop.sick.mf.f$treatment, prop.sick.mf.f$outcome,
lwd = 1.5,
col = "red",
xlab = "Treatment Level",
ylab = "Proportion Sick",
ylim = c(.15, .95))
points(prop.sick.mf.m$treatment, prop.sick.mf.m$outcome,
lwd = 1.5,
col = "blue")
legend("topright",
legend = c("Males", "Females"),
col = c("blue", "red"),
lwd = 1.5)
How does disease prevalence vary with drug dosage for male vs. female subjects? Would you describe the change in disease prevalence as gradual (i.e., linear) or sudden (i.e., nonlinear)? If you were a doctor, how would you decide what dosage to give? Would it be the same for males and females?
YOUR ANSWER:
Males seem to feel effects in the higher treatment levels, females fairly remain constant. I would say that disease prevalence in females in linear and doesn’t really change and the same can be said for males other than in the treatment group range of 4-6. If I were a doctor I would recommend giving dosage at 6 for men and probably 2 for women.
You are concerned that this novel drug may cause undesirable side effects. You will now study whether side effects increase as the dosage of the drug increases.
Determine whether the proportion of individuals with side effects varies 1) depending on being male or female and 2) depending on the drug dosage level.
Step 1: Create a new custom function that calculates the proportion of individuals with side effects. This function will be very similar to the function you created in Task 1! Instead of counting the number of times the word “sick” appears, you want to count the number of times that “yes” appears (as side effects are coded with a “yes” in the data).
Step 2: Use the aggregate() function to determine the proportion of male vs. female subjects that experience side-effects in the different drug dosage groups.
prop.se <- function(x) {
y <- sum(x == "yes") / length(x)
return(y)
}
prop.se.mf <- aggregate(side.effects ~ treatment + sex, FUN = prop.se, data = d2)
prop.se.mf
## treatment sex side.effects
## 1 1 female 0.200
## 2 2 female 0.450
## 3 3 female 0.300
## 4 4 female 0.350
## 5 5 female 0.500
## 6 6 female 0.400
## 7 7 female 0.600
## 8 8 female 0.700
## 9 9 female 0.725
## 10 10 female 0.925
## 11 1 male 0.300
## 12 2 male 0.275
## 13 3 male 0.325
## 14 4 male 0.475
## 15 5 male 0.350
## 16 6 male 0.550
## 17 7 male 0.700
## 18 8 male 0.750
## 19 9 male 0.775
## 20 10 male 0.875
How does the prevalence of side-effects change with dosage in male vs. female subjects? Can you tell from the resulting table?
YOUR ANSWER:
The prevalence of side-effects changes with dosage in males and females in a similar way, it increases with treatment level in both.
Use the plot() function to create a figure for the side effect data. First, plot the proportion proportion of males with side effects as a function of treatment level. Then, overlay the same data for females. Don’t forget your legend and informative axis titles!
You will first need to subset the data into a dataframe for males and a dataframe for females
prop.se.f <- prop.se.mf[prop.se.mf$sex == "female",]
prop.se.m <- prop.se.mf[prop.se.mf$sex == "male",]
plot(prop.se.f$treatment, prop.se.f$side.effects,
lwd = 1.5,
col = "red",
xlab = "Treatment Level",
ylab = "Proportion Sick",
ylim = c(.15, .95))
points(prop.se.m$treatment, prop.se.m$side.effects,
lwd = 1.5,
col = "blue")
legend("topleft",
legend = c("Males", "Females"),
col = c("blue", "red"),
lwd = 1.5)
The degree of side effects can be thought of as the “cost” of the drug, and the reduction in illness can be thought of as the “benefit” of the drug. Compare the graphs you made for drug efficacy (Task 2 Question 3) and side effects (Task 3 Question 2). As a doctor, what dosage would you recommend for males and for females, and why?
YOUR ANSWER:
If I were a doctor I would recommend giving dosage at 6 for men and probably 3 for women since the drug works best at those values in respect to limiting the side effects. Side effects prevent a strong dose for men, who benefit from a higher dose, and force females to take slightly higher since the results show that level two is an outlier proportion for side effects and taking less increases the prevalence of disease in females according to the first graph made.