Please, return your solutions as an R script file
i.e. FirstName_LastName_studentID.R
. Write your
explanations as comments to your codes.
Please, send your R script file to
yasinkaymaz@gmail.com
with your
FirstName_LastName_CBDS
on the subjet.
Deadline is 20th of January, 2023 at 11:55am
(before
noon time!!!). Returns after this will be failed!
Here is the link to DATASETS. Please, use right-click and open in a new browser page.
A researcher is interested in the effectiveness of a new drug for treating a certain condition. The researcher randomly assigns 20 patients to a treatment group, who are given the new drug, and 20 patients to a control group, who are given a placebo. The researcher measures the severity of the condition for each patient before and after the treatment. The data for the treatment group is stored in the variable treatment, and the data for the control group is stored in the variable control. Use an appropriate statistical test to determine if the mean change in the severity of the condition is greater for the treatment group than for the control group.
Data: “Question1.data.rds”
#Please, load your data as below.
data1 <- readRDS("Question1.data.rds")
head(data1)
## Patient Change_in_severity Condition
## 1 1 135.7907 treatment
## 2 2 104.2623 treatment
## 3 3 101.0977 treatment
## 4 4 168.6156 treatment
## 5 5 113.7471 treatment
## 6 6 124.4734 treatment
#YOUR CODE SHOULD BE HERE!
#...
You could also consider doing a Wilcoxon rank sum test (also known as the Mann-Whitney test) which is a non-parametric test that can be used in place of a t-test when the data is not normally distributed.
#YOUR CODE SHOULD BE HERE!
#...
Make sure to also provide an interpretation of the p-value, t-value and test statistics that are returned by the test in the light of your null-hypothesis and the significance level chosen. Please, interpret the results and state whether the drug used was successful or not?
A study was conducted to investigate the relationship between the average number of hours studied per week and the final exam scores of students. The data for this study is stored in a dataframe called study_data which has two columns: hours_studied and final_score. Use the Pearson correlation coefficient to determine the correlation between the number of hours studied and the final exam scores. Interpret the outcome in terms of positive, negative or no correlation. What does it mean to you? It would also be useful to visualize the relationship in form of a scatterplot. So, use graphical representation of data to prove your point.
Data: “Question2.data.rds”
#Please, load your data as below.
study_data <- readRDS("Question2.data.rds")
head(study_data)
## hours_studied final_score
## 1 2.716028 6.534859
## 2 2.689233 29.611794
## 3 3.196513 29.987681
## 4 2.457976 14.240253
## 5 2.174957 15.511948
## 6 3.260606 26.967631
#YOUR CODE SHOULD BE HERE!
#...
A study was conducted to investigate the effectiveness of a new teaching method for a certain subject. The study consisted of 30 students who were taught the subject using the traditional method for one month, and then taught the same subject using the new method for another month. The study collected the final test scores for each student for both the traditional and new teaching method. The data for this study is stored in a dataframe called study_data which has two columns: traditional_score and new_method_score. Use a paired t-test to determine if the mean final test scores are significantly different between the traditional and new teaching methods.
Data: “Question3.data.rds”
#Please, load your data as below.
study_data <- readRDS("Question3.data.rds")
head(study_data)
## traditional_score new_method_score
## 1 75.24097 71.76738
## 2 47.39232 61.38946
## 3 69.80320 81.07347
## 4 71.83140 80.84169
## 5 66.38649 67.27816
## 6 65.15516 63.93000
#YOUR CODE SHOULD BE HERE!
#...
A researcher is interested in investigating the relationship between the number of hours of exercise per week and the body mass index (BMI) of a sample of people. The data for this study is stored in a dataframe called exercise_data which has two columns: hours_exercise and bmi. Use linear regression to model the relationship between the number of hours of exercise per week and the BMI of the sample. Provide the equation that defines the linear relationship and interpret whether it is statistically significant or not? How does BMI change as hours_exercise increase? Explain further.
Data: “Question4.data.rds”
#Please, load your data as below.
exercise_data <- readRDS("Question4.data.rds")
head(exercise_data)
## hours_exercise bmi
## 1 2.439524 22.60884
## 2 2.769823 25.55503
## 3 4.558708 24.12318
## 4 3.070508 23.78021
## 5 3.129288 22.01586
## 6 4.715065 24.73873
#YOUR CODE SHOULD BE HERE!
#...
A study was conducted to investigate the effect of different fertilizers on plant growth. Four different types of fertilizers were used: A, B, C and D. Each fertilizer was applied to a group of 10 randomly selected plants, and the height of each plant was measured after one month. The data for this study is stored in a data frame called plant_data that includes three columns: fertilizer (a factor variable with levels A, B, C and D), plant (an identifier for each plant) and height (the measured height of each plant). Use an ANOVA to determine if there is a significant difference in the mean height of plants grown with the different fertilizers.
Data: “Question5.data.rds”
#Please, load your data as below.
plant_data <- readRDS("Question5.data.rds")
head(plant_data)
## fertilizer plant height
## 1 A A1 9.312957
## 2 A A2 13.243551
## 3 A A3 13.601749
## 4 A A4 9.222215
## 5 A A5 10.571286
## 6 A A6 11.351878
#YOUR CODE SHOULD BE HERE!
#...
A market research company wants to segment its customers based on their age, income and spending habits. The company has collected data on 1000 customers, which is stored in a data frame called customer_data that includes three columns: age, income, and spending. Use k-means clustering to segment the customers into 4 clusters based on their age, income and spending.
Data: “BonusQuestion-I.data.rds”
#Please, load your data as below.
customer_data <- readRDS("BonusQuestion-I.data.rds")
head(customer_data)
## age income spending cluster
## 1 28 37947 551 4
## 2 43 53015 1003 2
## 3 51 34609 979 3
## 4 17 56354 1652 1
## 5 44 57030 229 4
## 6 45 30941 508 3
#YOUR CODE SHOULD BE HERE!
#...
A study was conducted to investigate if there is a difference in conversion rate between two website designs. The study consisted of showing the two designs to different groups of website visitors and measuring the proportion of visitors who completed a purchase. The data for this study is stored in a data frame called conversion_data that includes two columns: design (a factor variable with levels A and B), and converted (a binary variable indicating whether the visitor completed a purchase or not). Use a permutation test to determine if the proportion of converted visitors is different between the two website designs.
Data: “BonusQuestion-II.data.rds”
#Please, load your data as below.
conversion_data <- readRDS("BonusQuestion-II.data.rds")
head(conversion_data)
## design converted
## 1 A 0
## 2 A 0
## 3 A 0
## 4 A 0
## 5 A 0
## 6 A 0
#YOUR CODE SHOULD BE HERE!
#...