In this R markdown document, you should answer each question and click the knit button to generate a pdf or HTML or word document, which should be submitted into D2L. Also, when you see “#TYPE…” in a code chunk, follow the guidance to finish the code, otherwise, your document cannot be knitted.
Type answers after the -> sign.
The lung data set contains information for Survival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities. The variables are:
inst: Institution code time: Survival time in days status: censoring status 2=censored, 1=dead age: Age in years sex: Male=1 Female=2 ph.ecog: ECOG performance score as rated by the physician. 0=asymptomatic, 1= symptomatic but completely ambulatory, 2= in bed <50% of the day, 3= in bed > 50% of the day but not bedbound, 4 = bedbound ph.karno: Karnofsky performance score (bad=0-good=100) rated by physician pat.karno: Karnofsky performance score as rated by patient meal.cal: Calories consumed at meals wt.loss: Weight loss in last six months
#Get the structure of the lung data set to view the variables in it
lung <- read.csv("https://raw.githubusercontent.com/xzhang47/3125/main/lung.csv")
str(lung) #View lung data structure
## 'data.frame': 228 obs. of 11 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ inst : int 3 3 3 5 1 12 7 11 1 7 ...
## $ time : int 306 455 1010 210 883 1022 310 361 218 166 ...
## $ status : chr "1-dead" "1-dead" "2-censored" "1-dead" ...
## $ age : int 74 68 56 57 60 74 68 71 53 61 ...
## $ sex : chr "Male" "Male" "Male" "Male" ...
## $ ph.ecog : int 1 0 0 1 0 1 2 2 1 2 ...
## $ ph.karno : int 90 90 90 90 100 50 70 60 70 70 ...
## $ pat.karno: int 100 90 90 60 90 80 60 80 80 70 ...
## $ meal.cal : int 1175 1225 NA 1150 NA 513 384 538 825 271 ...
## $ wt.loss : int NA 15 15 11 0 0 10 1 16 34 ...
-> it is a character variable.
-> it is listed as an integer. this is incorrect, it should be a categorical numerical.
#Create frequency table for the ph.ecog variable in the lung data frame
table(lung$ph.ecog)
##
## 0 1 2 3
## 63 113 50 1
-> 4
#Create frequency table for the sex variable in the lung data frame
table(lung$sex)
##
## Female Male
## 90 138
-> Male
#Create frequency table with sum total
#TYPE THE VARIABLE NAME sex AFTER DOLLAR SIGN TO GET SUM TOTALS
addmargins(table(lung$sex)) #the addmargins() function add the sum of the row
##
## Female Male Sum
## 90 138 228
-> 228
#Create relative frequency table
prop.table(table(lung$sex))
##
## Female Male
## 0.3947368 0.6052632
-> 39.47%
Making a graph in R requires you first create the data table and then save it as an object. We will create and save the table as an object called sexTable. Then, in the same code chunk, we’ll use the barplot() function to make a bar chart.
sexTable <- table(lung$sex) #Save abo variable's frequency table as an object called aboTable (this is required to make the bar chart)
barplot(sexTable, #This tells R to use the barplot function on the aboTable object
main="Basic Bar Chart of Sex", #This adds a title
xlab="Sex", #This adds an x-axis label
ylab = "Count", #This adds a y-axis label
col = "hotpink") #This give the bars a color just for the fun of it
-> Female
Technically, how this works is we create a bar chart and convert it to a circle. We’ll use the status variable to walk through whole process. And, of course, it will start with creating a frequency table and saving it as an object!
#1 create a frequency table and save it as an object
#TYPE THE VARIABLE NAME status AFTER DOLLAR SIGN BELOW
event <- table(lung$status)
#2 Convert frequency table object to a data frame
event_df <- as.data.frame(event)
#3 Change variable names from Var1 & Freq to original variable's name & Count
names(event_df) <- c("Event","Count")
#4 Calculate the percentage using one decimal place and save it as an object we will call pct for percent
pct <- round(100*event_df$Count/sum(event_df$Count), 1)
#5 Add a ‘%' sign and category name to each slice
pie(event_df$Count, #This makes the pie reflect the counts
labels = paste(event_df$Event, sep = " ", pct, "%"), #This adds the category names and percentages to the pie slices
radius = 1, #Radius = 1 controls size of pie & can keep labels from overlapping if slices are really small (not an issue on our chart)
main = "Base R Pie Chart of Status") #Add title
-> 72.4%
#The variable listed first gets put on the rows; addmargins adds the sum for each row & column; We believe sex is the explanatory variable, so we'll put it on the rows - that is why it is listed first
addmargins(table(lung$sex, lung$status))
##
## 1-dead 2-censored Sum
## Female 53 37 90
## Male 112 26 138
## Sum 165 63 228
-> Male
#save contingency as object, and then use the object in the prop.table function
lung_status <- table(lung$sex, lung$status)
row_prop <- prop.table(lung_status, 1)# the 1 tells it to give row proportions; save it as an object to use when we make a bar chart
row_prop
##
## 1-dead 2-censored
## Female 0.5888889 0.4111111
## Male 0.8115942 0.1884058
-> 58.89%
-> Female
#Get column proportions
(prop.table(lung_status, 2)) # the 2 tells it to give column proportions
##
## 1-dead 2-censored
## Female 0.3212121 0.5873016
## Male 0.6787879 0.4126984
-> 67.88%
-> Female
#Make side-by-side bar chart for abo and event We created a 2x2 table and saved it as lung_status. We will use that table to make the barplot by frequencies. Again, notice that to make the graph, we had to create the table and save it as an object first.
We put the status variable in the columns and sex on the rows in our table, so we need to label the x-axis for the columns. The row categories (sex) will show up in the legend and represent the height of the bars.
Then, we’ll use the row proportion table we saved as row_prop and create a bar chart by row proportions.
#Bar chart by frequencies
barplot(lung_status, main="Death and Gender Group", #Add title
xlab="Status", #Label x-axis
ylab = "Frequencies", #Label y-axis
col=c("hotpink","lightpink"), #Color code bars; there must be same number of colors as categories, so we need four
legend = rownames(lung_status), #Add legend
beside=TRUE) #Put bars beside each other instead of stacked (excluding beside = TRUE results in stacked bar chart) We can try it for fun
#Bar chart by row proportions
barplot(row_prop, main="Death and Gender Group", #Add title
xlab="Status", #Label x-axis
ylab = "Frequencies", #Label y-axis
col=c("hotpink","lightpink"), #Color code bars; there must be same number of colors as categories, so we need four
legend = rownames(lung_status), #Add legend
beside=TRUE) #Put bars beside each other instead of stacked (excluding beside = TRUE results in stacked bar chart) We can try it for fun
-> Female
# Get contingency table with ulcer on the rows and status on the columns and add the sum to each row & column
addmargins(table(lung$sex, lung$status))
##
## 1-dead 2-censored Sum
## Female 53 37 90
## Male 112 26 138
## Sum 165 63 228
-> 58.9%
-> If a lung cancer patient is a female, they have a 58.9% of dying from the cancer.
-> 0.727
-> A lung cancer patient who is female has 72.7% of the risk of dying from the cancer when compared to a lung cancer patient who is male.
-> 1.43
-> 0.332
-> This odds ratio indicates that lung cancer patients in this data set who are female have 66.8 % better odds of surviving the cancer when compared to a male lung patient in the data set
-> there does seem to be a difference in death and gender as females seem to have a much less likely chance of death when compared to their male counterparts. the odds ration of 0.332 shows that females have a significantly lower chance of dying from lung cancer.
Since you have commented about the relationship between gender and death, now, we will test the relationship with statistical evidence.
-> is there a correlation between gender and status in lung cancer patients?
-> H0: There is no correlation between gender and status in lung cancer patients.
-> Ha: There is a correlation between gender and status in lung cancer patients.
#Save contingency table as ChiTable
ChiTable <- table(lung$sex, lung$status)
ChiTable
##
## 1-dead 2-censored
## Female 53 37
## Male 112 26
#Conduct chi-square test, save it as an object
chisq <- chisq.test(x = ChiTable)
We are not going to view it yet, because unless the conditions are met, it is not useful. So, check the conditions first (R just requires you do the chi-square test first before you get the expected counts table - unless you want to write code to compute them first, but that is not necessary)
#Get both the observed and expected values
chisq$observed #Get the observed values
##
## 1-dead 2-censored
## Female 53 37
## Male 112 26
round(chisq$expected,3) #Get expected values. Round three decimal places (optional)
##
## 1-dead 2-censored
## Female 65.132 24.868
## Male 99.868 38.132
-> yes, all conditions are met
#Tell R to show the chi-square test output we saved as chisq and compare to the values from the last two code chunks
chisq
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: ChiTable
## X-squared = 12.42, df = 1, p-value = 0.0004247
-> 12.42
-> 0.0004247
-> the chi-squared test and p-value results are significant, therefore, we should REJECT the null hypothesis.
-> we should ACCEPT the alternative hypothesis, that there is a correlation between gender and status in lung cancer patients.