This assignment has four questions and it’s worth 10% of your final grade.
To assess your analytical and computing skills on the material covered.
.
Any code employed to complete this assignment must be self-explanatory and must be embedded in your answer using R Markdown code chunks. Screenshots and other images will not be allowed and will be penalized.
: There is a lateness penalty of 5 marks/day (or part of a day thereof), up to a maximum of 4 days, unless the student gets an approved SCA by NO later than Monday 10 April 2023. See note below.
If you need an extension with no lateness penalty because, e.g., your performance has been impacted by some extenuating, unexpected, circumstances, you can submit and SCA along with relevant evidence using the submission link from the STAT500 Home page. . If you have questions, contact or .
sid. For example, Jane Doe
with the student id 123456789 will havesid <- c(1,2,3,4,5,6,7,8,9)
sid <- c(2,1,1,5,3,6,9,5)
# Write your answer here
length(sid)
## [1] 8
# The length of variable 'sid' is 8
class(sid)
## [1] "numeric"
# The type of variable 'sid' is numeric because the variable is made up of characters which are numbers and therefore numeric
R functions, find mean, mode, median,
standard deviation, and variance of this variable.mean(sid)
## [1] 4
# The mean of variable 'sid' is 4
library(DescTools)
## Warning: package 'DescTools' was built under R version 4.2.3
Mode(sid) # The given values are 1 and 5, with the frequency underneath being 2. So both 1 and 5 are repeated in the variable 'sid' 2 times
## [1] 1 5
## attr(,"freq")
## [1] 2
median(sid) # This gives a value of 4, being the average between 3 and 5 the middlemost numbers
## [1] 4
sd(sid) # This gives a value of 2.777 rounded to 3 decimal places.
## [1] 2.77746
var(sid) # This gives a value of 7.714 being rounded to 3 decimal places
## [1] 7.714286
The median is the middlemost number or numbers when in order. The middlemost numbers in the variable ‘sid’ are 3 and 5 considering it is has 8 values. The average of 3 and 5 are 4 so the median is 4. It is best to use the median when measuring the central tendency because it is more accurate to use the median rather than the mean in case of outliers or values that are not like the rest such 6 and 9 in the variable ‘sid’. Those numbers are far off the rest of the values which could damage the reliablily of the results.
The mode is the number or numbers that occur the most frequently. The variable ‘sid’ has 2 of these, those being 1 and 5. Both values occur twice in the variable while all the others occur once. This is useful in learning how the values are distributed and can show patterns or trends such as 1 and 5 being repeated.
Run the code below, with the sid variable defined as
above.
set.seed(sum(sid))
d1 <- data.frame(percent_cured = runif(100, min = 0, max = 100),
age = rep(c('child', 'adult'), each = 50))
R, Create and add a variable called
dose to the data frame d1. Use
c(100, 200, 100, 200) with every value repeated 25 times so
that the length of dose matches the other two
variables.d1$dose <- rep(c( 100,200,100,200), each = 25)
# This function repeats the values 100,200,100,200, 25 times each so that the length matches the code.
R, find the summary statistics of
percent_cured for child and adult.summary(d1$percent_cured)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1632 25.8689 53.2639 52.0975 79.4956 99.7309
# This gives the minimum, maximum, mean, median and the 2 quartiles respectively
percent_cured for child and adult. You must consider your
answer to (b).# There is a very large distribution in the data. The lowest value being 0.5285 and the highest being 99.9918. The mean is 50.9058. This shows the data is spread very evenly and the values all showing for normal distribution. The minimum value shows that at least one patient was not cured while the highest value shows that at least one patient was cured. The mean shows that the average a person was cured or not cured was 50% or around that.
percent_cured, (2) a box plot of percent_cured
according to age, and (3) a box plot of
percent_cured according to dose. What is the
median of percent_cured according to
dose?hist(d1$percent_cured, main = "Histogram of Percent Cured")
# For 1) A histogram of Percent Cured
boxplot(percent_cured ~ age, data = d1, main = "Boxplot of Percent Cured by Age")
# For 2) A boxplot of percent cured by age
boxplot(percent_cured ~ dose, data = d1, main = "Boxplot of Percent Cured by Dose")
# For 3) A boxplot of percent cured by dose
aggregate(percent_cured ~ dose, d1, median)
## dose percent_cured
## 1 100 66.57221
## 2 200 45.73338
# Using the aggregate function, we can find out the median which comes to 42.0649 for a dose of 100, and 55.4639 for a dose of 200
sid variable in Question 1 and name it as
the sid.norm1 variable and find its mean.# Write your answer here
sid.norm1 is from
a normal distribution?# Write your answer here
sid.norm2 and sid.norm3.# Write your answer here
# Write your answer here
First, complete the following tasks:
Go to the COVID-19 data portal by Statistics New Zealand website at https://www.stats.govt.nz/experimental/covid-19-data-portal and click at the orange “DOWNLOAD DATA” button (next to ABOUT) around the middle of the page.
Choose two indicators of your own choice. You can select them one at at time and download its data.
Once you successfully download each indicator, delete the “metadata” sheet. Then, save the file in an appropriate folder.
Import the dataset for each indicator into
R
# Chosen the General Health and Smoking indicator and the General Health and Drinking indicator. Both Datasets are imported into R.
#In the Health and Smoking datasets, the type is numeric as the values are shown in percentages. 50% is also seen as 0.5 therefore all values are numeric. I've chosen this indicator as it relates to the world and other people my age. New Zealand has a drinking culture and I feel like the results of this can be helpful to visualise the problems people may have had during lockdown.
#In the Health and Drinking datasets, the type is also numeric. I chose this variable for the same reason, during lockdown many people needed a vice to escape the household, and I feel like many people may have taken up smoking. The results of this could also help to visualise the problems associated with smoking and help to mitigate the damage.
# Write your answer here
# Write your answer here