Data Visualiazation Assignment

My project topic is to study the effectiveness of Botox treatment for non-cosmetic conditions. My dataset comes from an experiment study on learning the effectiveness of Botox treatment on curing headache. There are 7 different variables. First, let’s insert the data and remove the missing data.

library(readxl)
Experiment_Info <- read_excel("C:/Users/yanru/Desktop/R File/Experiment Info.xlsx")
View(Experiment_Info)

str(Experiment_Info)

## Classes 'tbl_df', 'tbl' and 'data.frame':    50 obs. of  7 variables:
##  $ Group   : chr  "Study 1" "Study 1" "Study 1" "Study 1" ...
##  $ Area    : chr  "Frontalis" "Corrugator" "Procerus" "Occipitalis" ...
##  $ Units   : chr  "20" "10" "5" "30" ...
##  $ Sites   : chr  "4" "2" "1" "6" ...
##  $ Level   : chr  "Medium" "Medium" "Low" "Medium" ...
##  $ Patients: chr  "83" "77" "52" "50" ...
##  $ Days    : num  5.9 6.2 4.8 6.8 6.5 7 5.1 6.7 6.8 6.2 ...

tail(Experiment_Info)

## # A tibble: 6 x 7
##   Group   Area              Units Sites Level  Patients  Days
##   <chr>   <chr>             <chr> <chr> <chr>  <chr>    <dbl>
## 1 Study 2 Pectineus         20    4     Medium 14         6.6
## 2 Study 2 Sartorius         5     4     Low    19         7.1
## 3 Study 2 Lliopsoas         5     1     Low    24         6.9
## 4 Study 2 Pectineus         20    4     Medium 37         6.1
## 5 Study 2 Tibialis Anterior 15    1     Low    39         5.8
## 6 Study 2 Rectus Femoris    30    6     Medium 15         6.2

Experiment_Info1=Experiment_Info[-5,]
Experiment_Info2=Experiment_Info1[-8,]
Experiment_Info3=Experiment_Info2[-11,]
Experiment_Info4=Experiment_Info3[-26,]
Experiment_Info5=Experiment_Info4[-28,]
tail(Experiment_Info5)

## # A tibble: 6 x 7
##   Group   Area              Units Sites Level  Patients  Days
##   <chr>   <chr>             <chr> <chr> <chr>  <chr>    <dbl>
## 1 Study 2 Pectineus         20    4     Medium 14         6.6
## 2 Study 2 Sartorius         5     4     Low    19         7.1
## 3 Study 2 Lliopsoas         5     1     Low    24         6.9
## 4 Study 2 Pectineus         20    4     Medium 37         6.1
## 5 Study 2 Tibialis Anterior 15    1     Low    39         5.8
## 6 Study 2 Rectus Femoris    30    6     Medium 15         6.2

View(Experiment_Info5)

It looks like there are missing values in dataset Experiment Info. So I removed those lines in order to clean all the missing values.

My Y variable will be Number of days headache reduced.

hist(Experiment_Info5$Days, xlab = "Number of Days Headache Reduced", main = "Headache Days Reduced After Treatment")

library(e1071)

## Warning: package 'e1071' was built under R version 3.6.3

skewness(Experiment_Info5$Days)

## [1] -0.5360097

kurtosis(Experiment_Info5$Days)

## [1] -0.2362777

From the histogram, we can see that headache days reduced between 6.5 and 7 days showed up more than 12 times, days reduced between 6 and 6.5 days showed up 10 times, days reduced between 5.5 and 6 showed up 12 times, while days reduced bteween 5 and 5.5 only showed up once. Most of the times, the Botox treatment can reduce the headache days between 6.5 and 7 days and between 5.5 and 6 days. The skewness is -0.54 which means there is a longer tail on the left side and the mean is less than the median and the median is less than the mode.This distribution has more occurrences in the upper values (right side) and few in the lower values (left side). The kurtosis is -0.23 which means the distribution is more flatter than the normal distribution and the outlier of my data is less extreme than expected in the date of a normal distribution.

My X variable will be the number of sites of Botox injections for muscles.

boxplot(Experiment_Info5$Days~Experiment_Info5$Sites, ylab = "Number of Days Reduced", xlab= "Sites", main= "Number of Days Reduced of Different Sites Injected")

In the boxplot, we can see that we have 6 different number of sites injected in muscles. For 100 and 5 sites distributions, there is only one value for each of them so that we just saw a line there. The median value is 6.1 for 1 site, 6.2 for 2 sites, 6.1 for 4 sites and 6.9 for 6 sites distributions. The minimum value is 4.6 for 1 site, 4.7 for 2 sites, and 6.2 for 6 sites distributions. The maximum value is 7.5 for 1 site, 7 for 2 sites, 7.3 for 4 sites and 7.3 for 6 sites distributions. The 25% mark is 5.6 for 1 site, 5.9 for 2 sites, 5.9 for 4 sites and 6.3 for 6 sites distributions, respectively. The 75 % mark is 6.8 for 1 site, 6.8 for 2 sites, 6.9 for 4 sites and 7.1 for 6 sites distributions, respectively. I didn’t see any outliers. 6 sites distribution is negative skewed since the median is closer to the top of the box. 1 site, 2 sites and 4 sites distributions are positive skewed since their medians are closer to the bottom of the box. 1 site distribution has more dispersion than 2 sites, 4 sites and 6 sites distributions since it has a larger box section. Their medians are all close to each other except 6 sites distribution. The 6 sites distribution has a median almost near the 75% mark of the other three distributions (1,2 and 4) which means it has a big difference compared with the other three distributions.

My variables are continuous and I will create a scatter plot.

plot(Experiment_Info5$Sites,Experiment_Info5$Days, xlab = "Sites", ylab = "Days Headache Reduced", main = "Days Headache Reduced for Different Sites Injected")

100 sites of injections looks like an outlier. Let’s remove it temporarily to get a better view of the plot.

Experiment_Info6=Experiment_Info5[-21,]
plot(Experiment_Info6$Sites,Experiment_Info6$Days, xlab = "Sites", ylab = "Days Headache Reduced", main = "Days Headache Reduced for Different Sites Injected")

Now, it looks better. I don’t see an obvious trend. So it looks like that sites doesn’t have a big impact on the days headache reduced. It looks like most of the patients got injected in 1 site of their muscles in the experiment. But the result varies from 4.6 days reduced to 7.6 days reduced for 1 site injection. So there must be some other factors influencing the result. I will add one more factor in the plot to see if we can find a trend.

library(ggplot2)
ggplot(Experiment_Info6,aes(x=Sites, y=Days, shape=Group, color=Group))+geom_point()

The red circle represents study 1 and the blue triangle represents study 2.It looks like study 1 has a trend of positive linear relationship between sites injected(independent variable) and headache days reduced(dependent variable). I didn’t see an obvious relationship between reduced days and sites injected in study 2.Maybe finding more data can solve this problem or maybe there is another variable influencing the result in study 2. No patients in study 1 got injected in 5 sites. More patients got injected in 1 site of the muscle in study 2. No outliers found in this plot. All the data falls between 4.5 and 7.5. It seems like Study 2 is slightly effective than study 1 since most of the data in study 2 falls slightly above the data in study 1.

Data Visualiazation Assignment

Rui Yan

2020/4/11