#install.packages("ggplot2")
library(ggplot2)
#install.packages("readr")
library(readr)
bca <- read_csv("BrainCancerData1.csv")
## Rows: 48 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): year, rate_cases1, rate_cases2, death_rate, pct_survival
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Part A Question 1: Determine the data types column \(rate\_cases2\), \(death\_rate\), and \(pct\_survival\)
# Write in the following chunk.
# Hint: follow this syntax str(dataframe) to obtain the structure of the dataset!
str(bca)
## spc_tbl_ [48 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ year : num [1:48] 1975 1976 1977 1978 1979 ...
## $ rate_cases1 : num [1:48] NA NA NA NA NA NA NA NA NA NA ...
## $ rate_cases2 : num [1:48] 5.93 5.86 6.29 5.8 6.12 6.57 6.53 6.39 6.37 5.96 ...
## $ death_rate : num [1:48] 4.11 4.34 4.4 4.53 4.26 4.37 4.36 4.43 4.39 4.55 ...
## $ pct_survival: num [1:48] 23.5 22.6 22.3 25.2 23.1 ...
## - attr(*, "spec")=
## .. cols(
## .. year = col_double(),
## .. rate_cases1 = col_double(),
## .. rate_cases2 = col_double(),
## .. death_rate = col_double(),
## .. pct_survival = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# all three columns are numeric
All three columns are numeric.
Question 2: Index the column \(rate\_cases1\) from the dataset
# Write in the following chunk.
# Hint: follow the syntax dataframe$column name to index a specific column
bca$year
## [1] 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
## [16] 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
## [31] 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
## [46] 2020 2021 2022
# dataframe[row,column]
bca[1,] #rows
bca[,1] #columns
bca[1,1]
bca$rate_cases1
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [16] NA NA 6.73 6.50 6.32 6.30 6.38 6.48 6.47 6.67 6.38 6.29 6.44 6.49 6.54
## [31] 6.54 6.26 6.43 6.40 6.63 6.31 6.15 6.25 6.31 6.13 6.40 6.07 6.18 6.17 5.87
## [46] 5.90 5.67 NA
Question 1: Find the mean rate of new brain cancer cases from \(rate\_cases1\)
# Below is an example function you can apply for mean calculation
avg_rate_cases<-mean(bca$rate_cases1, na.rm = TRUE)
avg_rate_cases
## [1] 6.322
The mean value of “rate_cases1” was calculated to be 6.322. This tells us that on average,there are approximately 6.322 new cases of brain cancer per a given number of individuals (likely 100,000 people) each year. This gives us a general sense of the incidence rate across the population.
Now it is your turn to calculate the mean for \(rate\_cases2\), \(death\_rate\), and \(pct\_survival\)
# Write in the following chunk:
avg_rate_cases2<-mean(bca$rate_cases2, na.rm = TRUE)
avg_rate_cases2
## [1] 6.561489
avg_death_rate<-mean(bca$death_rate, na.rm = TRUE)
avg_death_rate
## [1] 4.483958
avg_pct_survival<-mean(bca$pct_survival, na.rm = TRUE)
avg_pct_survival
## [1] 31.21929
Question 2: Find the standard deviation of new brain cancer cases from \(rate\_cases1\)
# Below is an example function you can apply for standard deviation calculation
sd_rate_cases<-sd(bca$rate_cases1, na.rm = TRUE)
sd_rate_cases
## [1] 0.236372
The standard deviation value of “rate_cases1” was calculated to be 0.236. This tells us that the rate of new brain cancer cases typically varies by about 0.236 cases from the average in this dataset. This low variability indicates that the incidence of brain cancer remained fairly consistent from year to year.
Now calculate the standard deviation for \(rate\_cases2\), \(death\_rate\), and \(pct\_survival\)
# Write in the following chunk:
sd_rate_cases2<-sd(bca$rate_cases2, na.rm = TRUE)
sd_rate_cases2
## [1] 0.3297692
sd_death_rate<-sd(bca$death_rate, na.rm = TRUE)
sd_death_rate
## [1] 0.1983092
sd_pct_survival<-sd(bca$pct_survival, na.rm = TRUE)
sd_pct_survival
## [1] 4.832159
The mean death rate was calculated to be 4.484, with a standard deviation of 0.198. This low standard deviation tells us that brain cancer death rates were also highly consistent over time. This information can be important for public health analysis as it helps with long-term planning and evaluation of treatment outcomes. The consistency in death rates may also point to a lack of significant improvement in reducing mortality, indicating the need for new innovations.
The mean of “pct_survival” was calculated as 31.219, while its standard deviation is 4.832. These values indicate that, on average, about 31.2% of individuals diagnosed with brain cancer survived, but that this percentage varied more from year to year compared to incidence or death rates. The higher standard deviation suggests greater fluctuations in survival rates, possibly due to new treatments, early detection efforts, or other healthcare factors. Compared to the new case and death rates, survival rates were less consistent over time, which could be important for analyzing improvements/setbacks in patient outcomes.
Question 3: Find the range of new brain cancer cases from \(rate\_cases1\)
# Below is an example function you can apply for range calculation
range_rate_cases<-range(bca$rate_cases1, na.rm = TRUE)
range_rate_cases
## [1] 5.67 6.73
Given that the range of “rate_cases1” is from 5.67 to 6.73, this suggests that the incidence rates over the years fell within a relatively narrow band, with the lowest rate at 5.67 new cases and the highest rate at 6.73 new cases. Combined with the low standard deviation, this range indicates that brain cancer rates were stable and did not show extreme highs or lows during the measured period. Most yearly rates remained close to the mean, reinforcing the idea of low year-to-year variability.
Now calculate the range for \(rate\_cases2\), \(death\_rate\), and \(pct\_survival\)
# Write in the following chunk:
range_rate_cases2<-range(bca$rate_cases2, na.rm = TRUE)
range_rate_cases2
## [1] 5.80 7.17
range_death_rate<-range(bca$death_rate, na.rm = TRUE)
range_death_rate
## [1] 4.11 4.95
range_pct_survival<-range(bca$pct_survival, na.rm = TRUE)
range_pct_survival
## [1] 22.33 37.93
Question 1: a) Perform a linear regression where \(year\) is the independent variable (predictor) and \(rate\_cases1\) is the dependent variable (outcome).
# Write in the following chunk:
# Hint: follow the syntax lm(y~x, data = dataframe)
linear_reg1 <- lm(rate_cases1 ~ year, data = bca)
summary(linear_reg1)
##
## Call:
## lm(formula = rate_cases1 ~ year, data = bca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36981 -0.08465 0.00650 0.11267 0.35665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.371749 7.014320 6.468 5.24e-07 ***
## year -0.019462 0.003496 -5.567 5.90e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1657 on 28 degrees of freedom
## (18 observations deleted due to missingness)
## Multiple R-squared: 0.5254, Adjusted R-squared: 0.5084
## F-statistic: 30.99 on 1 and 28 DF, p-value: 5.901e-06
Write down the linear regression equation: \[ ratecases1 = 45.372 - 0.0195(year) \] b) Find the coefficients and p-values of the intercept and slope, and interpret it respectively.
The coefficient of the intercept is 45.372. This refers to the predicted value of \(rate\_cases1\) when \(year\) is 0. The p-value of the intercept is 5.24e-07, which is smaller than 0.05.
The coefficient of the slope is -0.0195. This means, for every one-unit increase in \(year\), \(rate\_cases1\) decreases by 0.0195. The p-value of the slope is 5.90e-06, which is also smaller than 0.05. In conclusion, we can reject the null hypothesis that the intercept or slope is 0, indicating that there is a relationship between rate cases1 and year.
linear_reg2 <- lm(pct_survival~year, data = bca)
summary(linear_reg2)
##
## Call:
## lm(formula = pct_survival ~ year, data = bca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.269 -1.617 -0.091 1.784 3.142
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -682.25645 52.14369 -13.08 4.89e-16 ***
## year 0.35754 0.02613 13.68 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.053 on 40 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.824, Adjusted R-squared: 0.8196
## F-statistic: 187.2 on 1 and 40 DF, p-value: < 2.2e-16
Linear regression equation: \[ pct surival = -682.256 + 0.358(year) \]
The coefficient of the intercept is -682.256. This refers to the predicted value of \(pct\_survival\) when year is 0. The p-value of the intercept is 4.89e-16, which is smaller than 0.05.
The coefficient of the slope is 0.358. This means, for every one-unit increase in \(year\), \(pct\_survival\) increases by 0.358. The p-value of the slope is < 2e-16, which is also smaller than 0.05. In conclusion, we can reject the null hypothesis that the intercept or slope is 0, indicating that there is a relationship between pct survival and year.
linear_reg3 <- lm(death_rate ~ year, data = bca)
summary(linear_reg3)
##
## Call:
## lm(formula = death_rate ~ year, data = bca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45346 -0.14863 -0.01998 0.16185 0.44067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.245058 4.053187 2.774 0.00797 **
## year -0.003383 0.002028 -1.668 0.10208
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1947 on 46 degrees of freedom
## Multiple R-squared: 0.05704, Adjusted R-squared: 0.03654
## F-statistic: 2.783 on 1 and 46 DF, p-value: 0.1021
Linear regression equation: \[ death rate = 11.245 - 0.003(year) \] The coefficient of the intercept is 11.245. This refers to the predicted value of death_rate when year is 0. The p-value of the intercept is 0.00797, which is smaller than 0.05. The coefficient of the slope is -0.003. This means, for every one-unit increase in year, death rate decreases by 0.003 units. The p-value of the slope is 0.10208, which is larger than 0.05.
In conclusion, we can reject the null hypothesis that the intercept is 0, but we can’t reject the null hypothesis that the slope is 0, indicating that there isn’t a statistically significant relationship between death rate and year.
Question 2: Plot the regression of \(year\) vs \(rate\_cases1\)
ggplot(bca,aes(x=year,y=rate_cases1)) +
geom_point(color = "red", size = 4)+
geom_smooth(method = "lm") +
labs(x = "Year", y = "Rate of New Cases", title = "Year vs New Case Rate") +theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 18 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 18 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggplot(bca,aes(x=year,y=rate_cases1))+
geom_point()+ #scatter plot
geom_smooth(method="lm")+ #add linear regression
labs(x="Year",y="Rate of New Cases", title="Year vs. New Case Rate")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 18 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 18 rows containing missing values or values outside the scale range
## (`geom_point()`).
Now plot the regression of \(year\) vs \(pct\_survival\)
ggplot(data = bca,
aes(x = year, y = pct_survival, color = pct_survival)) +
scale_color_viridis_c()+
geom_point(size = 2)+
geom_smooth(method = "lm", color = "green")+
labs(x = "Year", y = "PCT Survival Rate", title ="Year vs PCT survival rate")+
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggplot(data = bca,
aes(x = year, y = death_rate))+
geom_point(size = 2)+
geom_smooth(method = "lm", color = "pink")+
labs(x = "Year", y = "Death Rate", title ="Year vs Death Rate")+
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'