In this R markdown document, you should answer each question and click the knit button to generate a pdf or HTML or word document, which should be submitted into D2L. Also, when you see “#TYPE…” in a code chunk, follow the guidance to finish the code, otherwise, your document cannot be knitted.

Type answers after the -> sign.

Qualitative (Categorical) Data Analysis for R’s lung data

The lung data set contains information for Survival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities. The variables are:

inst: Institution code time: Survival time in days status: censoring status 2=censored, 1=dead age: Age in years sex: Male=1 Female=2 ph.ecog: ECOG performance score as rated by the physician. 0=asymptomatic, 1= symptomatic but completely ambulatory, 2= in bed <50% of the day, 3= in bed > 50% of the day but not bedbound, 4 = bedbound ph.karno: Karnofsky performance score (bad=0-good=100) rated by physician pat.karno: Karnofsky performance score as rated by patient meal.cal: Calories consumed at meals wt.loss: Weight loss in last six months

#Get the structure of the lung data set to view the variables in it
lung <- read.csv("https://raw.githubusercontent.com/xzhang47/3125/main/lung.csv")
str(lung) #View lung data structure
## 'data.frame':    228 obs. of  11 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ inst     : int  3 3 3 5 1 12 7 11 1 7 ...
##  $ time     : int  306 455 1010 210 883 1022 310 361 218 166 ...
##  $ status   : chr  "1-dead" "1-dead" "2-censored" "1-dead" ...
##  $ age      : int  74 68 56 57 60 74 68 71 53 61 ...
##  $ sex      : chr  "Male" "Male" "Male" "Male" ...
##  $ ph.ecog  : int  1 0 0 1 0 1 2 2 1 2 ...
##  $ ph.karno : int  90 90 90 90 100 50 70 60 70 70 ...
##  $ pat.karno: int  100 90 90 60 90 80 60 80 80 70 ...
##  $ meal.cal : int  1175 1225 NA 1150 NA 513 384 538 825 271 ...
##  $ wt.loss  : int  NA 15 15 11 0 0 10 1 16 34 ...
  1. What type of variable is status?

-> it is a character variable.

  1. What type of variable did R list ph.ecog as, and is it accurate?

-> it is listed as an integer. this is incorrect, it should be a categorical numerical.

#Create frequency table for the ph.ecog variable in the lung data frame
table(lung$ph.ecog)
## 
##   0   1   2   3 
##  63 113  50   1
  1. How many categories does the ph.ecog variable have?

-> 4

#Create frequency table for the sex variable in the lung data frame
table(lung$sex)
## 
## Female   Male 
##     90    138
  1. Which gender group has more patients?

-> Male

#Create frequency table with sum total
#TYPE THE VARIABLE NAME sex AFTER DOLLAR SIGN TO GET SUM TOTALS
addmargins(table(lung$sex)) #the addmargins() function add the sum of the row
## 
## Female   Male    Sum 
##     90    138    228
  1. How many lung cancer patients were included in the study?

-> 228

#Create relative frequency table
prop.table(table(lung$sex))
## 
##    Female      Male 
## 0.3947368 0.6052632
  1. What proportion (or can convert to percentage) are female lung cancer patients?

-> 39.47%

Make a barchart (called barplot in R)

Making a graph in R requires you first create the data table and then save it as an object. We will create and save the table as an object called sexTable. Then, in the same code chunk, we’ll use the barplot() function to make a bar chart.

sexTable <- table(lung$sex) #Save abo variable's frequency table as an object called aboTable (this is required to make the bar chart)

barplot(sexTable, #This tells R to use the barplot function on the aboTable object
        main="Basic Bar Chart of Sex", #This adds a title
   xlab="Sex", #This adds an x-axis label
   ylab = "Count", #This adds a y-axis label
   col = "hotpink") #This give the bars a color just for the fun of it

  1. Based on the bar chart, which gender had the fewer number of patients in the study?

-> Female

Create a pie chart

Technically, how this works is we create a bar chart and convert it to a circle. We’ll use the status variable to walk through whole process. And, of course, it will start with creating a frequency table and saving it as an object!

#1 create a frequency table and save it as an object
#TYPE THE VARIABLE NAME status AFTER DOLLAR SIGN BELOW
event <- table(lung$status) 
#2 Convert frequency table object to a data frame
event_df  <- as.data.frame(event) 
#3 Change variable names from Var1 & Freq to original variable's name & Count
names(event_df) <- c("Event","Count") 
#4 Calculate the percentage using one decimal place and save it as an object we will call pct for percent
pct <- round(100*event_df$Count/sum(event_df$Count), 1) 
#5 Add a ‘%' sign and category name to each slice
pie(event_df$Count, #This makes the pie reflect the counts
    labels = paste(event_df$Event, sep = " ", pct, "%"), #This adds the category names and percentages to the pie slices
    radius = 1, #Radius = 1 controls size of pie & can keep labels from overlapping if slices are really small (not an issue on our chart)
    main = "Base R Pie Chart of Status") #Add title

  1. What percentage of the 228 people on the study are dead?

-> 72.4%

Create 2x2 (contingency) table with row & column totals

#The variable listed first gets put on the rows; addmargins adds the sum for each row & column; We believe sex is the explanatory variable, so we'll put it on the rows - that is why it is listed first
addmargins(table(lung$sex, lung$status))
##         
##          1-dead 2-censored Sum
##   Female     53         37  90
##   Male      112         26 138
##   Sum       165         63 228
  1. Which gender has the higher death counts?

-> Male

Create a relative frequency table using the prop.table function (proportion table) by row percentages

Save the contingency table as an object and use the object in the proportion function

#save contingency as object, and then use the object in the prop.table function
lung_status <- table(lung$sex, lung$status) 
row_prop <- prop.table(lung_status, 1)# the 1 tells it to give row proportions; save it as an object to use when we make a bar chart
row_prop
##         
##             1-dead 2-censored
##   Female 0.5888889  0.4111111
##   Male   0.8115942  0.1884058
  1. Of all the female, what proportion died?

-> 58.89%

  1. Which gender group had the less proportion of deaths?

-> Female

#Get column proportions
(prop.table(lung_status, 2)) # the 2 tells it to give column proportions
##         
##             1-dead 2-censored
##   Female 0.3212121  0.5873016
##   Male   0.6787879  0.4126984
  1. Out of all who died, what proportion are males?

-> 67.88%

  1. The smaller proportion of deaths came from which gender group?

-> Female

#Make side-by-side bar chart for abo and event We created a 2x2 table and saved it as lung_status. We will use that table to make the barplot by frequencies. Again, notice that to make the graph, we had to create the table and save it as an object first.

We put the status variable in the columns and sex on the rows in our table, so we need to label the x-axis for the columns. The row categories (sex) will show up in the legend and represent the height of the bars.

Then, we’ll use the row proportion table we saved as row_prop and create a bar chart by row proportions.

#Bar chart by frequencies
barplot(lung_status, main="Death and Gender Group", #Add title
  xlab="Status", #Label x-axis
  ylab = "Frequencies", #Label y-axis
col=c("hotpink","lightpink"), #Color code bars; there must be same number of colors as categories, so we need four
  legend = rownames(lung_status), #Add legend
beside=TRUE) #Put bars beside each other instead of stacked (excluding beside = TRUE results in stacked bar chart) We can try it for fun

#Bar chart by row proportions
barplot(row_prop, main="Death and Gender Group", #Add title
  xlab="Status", #Label x-axis
  ylab = "Frequencies", #Label y-axis
col=c("hotpink","lightpink"), #Color code bars; there must be same number of colors as categories, so we need four
  legend = rownames(lung_status), #Add legend
beside=TRUE) #Put bars beside each other instead of stacked (excluding beside = TRUE results in stacked bar chart) We can try it for fun

  1. Based on the frequencies bar chart, which gender group has less deaths?

-> Female

# Get contingency table with ulcer on the rows and status on the columns and add the sum to each row & column
addmargins(table(lung$sex, lung$status))
##         
##          1-dead 2-censored Sum
##   Female     53         37  90
##   Male      112         26 138
##   Sum       165         63 228
  1. What is the risk of death if a patient is a female?

-> 58.9%

  1. Interpret the risk above:

-> If a lung cancer patient is a female, they have a 58.9% of dying from the cancer.

  1. What is the relative risk (i.e. risk ratio) of death if the patient is female as compared to males?

-> 0.727

  1. Interpret the relative risk above:

-> A lung cancer patient who is female has 72.7% of the risk of dying from the cancer when compared to a lung cancer patient who is male.

  1. What are the odds of death if a patient is female?

-> 1.43

  1. What is the odds ratio of death if the patient is female as compared to males?

-> 0.332

  1. Interpret the odds ratio above:

-> This odds ratio indicates that lung cancer patients in this data set who are female have 66.8 % better odds of surviving the cancer when compared to a male lung patient in the data set

  1. Explain why there does or does not appear to be a relationship between the death and gender based on the above results you have seen.

-> there does seem to be a difference in death and gender as females seem to have a much less likely chance of death when compared to their male counterparts. the odds ration of 0.332 shows that females have a significantly lower chance of dying from lung cancer.

Conduct test of independence with Chi-square test.

Since you have commented about the relationship between gender and death, now, we will test the relationship with statistical evidence.

  1. What is our research question in terms of test of independence? What is the name of test?

-> is there a correlation between gender and status in lung cancer patients?

  1. What is the null hypothesis?

-> H0: There is no correlation between gender and status in lung cancer patients.

  1. What is the alternative hypothesis?

-> Ha: There is a correlation between gender and status in lung cancer patients.

#Save contingency table as ChiTable
ChiTable <- table(lung$sex, lung$status)
ChiTable
##         
##          1-dead 2-censored
##   Female     53         37
##   Male      112         26
#Conduct chi-square test, save it as an object
chisq <- chisq.test(x = ChiTable)

We are not going to view it yet, because unless the conditions are met, it is not useful. So, check the conditions first (R just requires you do the chi-square test first before you get the expected counts table - unless you want to write code to compute them first, but that is not necessary)

#Get both the observed and expected values
chisq$observed #Get the observed values
##         
##          1-dead 2-censored
##   Female     53         37
##   Male      112         26
round(chisq$expected,3) #Get expected values. Round three decimal places (optional)
##         
##          1-dead 2-censored
##   Female 65.132     24.868
##   Male   99.868     38.132
  1. Based on the expected values, are all the conditions met?

-> yes, all conditions are met

#Tell R to show the chi-square test output we saved as chisq and compare to the values from the last two code chunks
chisq
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  ChiTable
## X-squared = 12.42, df = 1, p-value = 0.0004247
  1. What is the chi-square test statistic?

-> 12.42

  1. What is the p-value (take it out of scientific notation)?

-> 0.0004247

  1. What is the conclusion we should make about the null hypothesis: Should we reject it or fail to reject it?

-> the chi-squared test and p-value results are significant, therefore, we should REJECT the null hypothesis.

  1. What is the conclusion we should make about the alternative hypothesis?

-> we should ACCEPT the alternative hypothesis, that there is a correlation between gender and status in lung cancer patients.

Knit file as either an html or a PDF. Submitting as a PDF may require installing the tinytex package. If you can, great. If not, the just knit as html. Submit your file in the D2L homework assignment 2 submission folder.