This is R Markdown

You can create nice publishable documents using R Markdown

You should be comfortable using R Markdown for your R assignments.
You can find cheatsheets for R Markdown here https://rmarkdown.rstudio.com/lesson-1.html

Let’s begin the assignment.

Import the data.table and ggplot2 packages. If they are not already installed install: install.packages(‘ggplot2’) install.packages(‘data.table’)

library(data.table)
library(ggplot2)

Import the Binomial.csv data. There are many ways import a .csv file, but fread() will import it as a data.table

bi <- fread('C:/Users/linds/Documents/MSDS 600/Week 4/binomial.csv')

Calculate summary statistics and standard deviation. Plot the data as a scatter plot, histogram, and boxplot

summary(bi)
##        V1       
##  Min.   :57.00  
##  1st Qu.:67.00  
##  Median :70.00  
##  Mean   :70.17  
##  3rd Qu.:73.00  
##  Max.   :84.00
sd(bi$V1)
## [1] 4.689325
plot(bi$V1, xlab = 'bi Index', ylab = 'bi values', main = 'Scatter plot of the Binomial dataset')

hist(bi$V1, xlab = 'bi values', ylab = 'Frequency of bi values', main = 'Histogram of the Binomial dataset')

boxplot(bi, xlab = 'bi', ylab = 'bi values', main = 'Boxplot of the Binomial dataset')

Load the six datasets in the week 4 materials. Compute a set of descriptive statistics for each; including mean, standard deviation, minimum, maximum, a histogram plot, and any other descriptive statistic you might find meaningful - include these in your write-up.

Import the Ln.csv data. Calculate summary statistics and standard deviation. Plot the data as a scatter plot, histogram, and boxplot.

##        V1       
##  Min.   : 3.00  
##  1st Qu.:16.00  
##  Median :19.00  
##  Mean   :18.99  
##  3rd Qu.:22.00  
##  Max.   :43.00
## [1] 4.362612

Import the BN1.csv and BN2.csv data and calculate summary statistics, standard deviation and plot the histograms individually. Plot the BN1 and BN2 together on one graph as a scatter plot, and as boxplots.

##        V1        
##  Min.   : 1.781  
##  1st Qu.: 8.643  
##  Median : 9.993  
##  Mean   : 9.994  
##  3rd Qu.:11.343  
##  Max.   :18.612
##        V1        
##  Min.   : 6.638  
##  1st Qu.:10.321  
##  Median :10.998  
##  Mean   :10.997  
##  3rd Qu.:11.667  
##  Max.   :15.161
## [1] 2.000357
## [1] 0.9994714

Import the N1.csv and N2.csv data and calculate summary statistics, standard deviation and plot the histograms individually. Plot the BN1 and BN2 together on one graph as a scatter plot, and as boxplots.

##        V1        
##  Min.   : 5.064  
##  1st Qu.: 8.774  
##  Median :10.093  
##  Mean   :10.134  
##  3rd Qu.:11.458  
##  Max.   :15.767
##        V1         
##  Min.   : 0.6503  
##  1st Qu.: 9.0030  
##  Median :11.9056  
##  Mean   :11.7082  
##  3rd Qu.:14.1246  
##  Max.   :22.6275
## [1] 1.995706
## [1] 4.291276

What, if anything, can be said about the differences between N1 & N2? #n2 has a larger range than n1; the minimum is 0.65 vs. 5 in n1, and the maximum is 23 vs 16 in n1. n2 also has a larger median than n1; the middle of n2’s dataset is falling about 2 values higher than n1. The only similarity between the two is that Q1 is nearly the same, each rounding to 9, which means the first 25% of each dataset falls at 9 and below

Similarly, what can be said about the differences between BN1 & BN2? #bn1 has a larger range than bn2; bn1’s range is around 17 values (min 2, max 19) vs. bn2 which is around 8.5 values (min 6.64, max 15.16). All of bn2’s data falls within the bn1 dataset, which can be seen clearly within the scatter plot; however you can see the medians are still slightly different with bn2 reflected around 11 vs. bn1 around 10. The only similarity is that both have identical means and medians within each dataset, which is why each have a symmetrical distribution

What are the differences in the distributions, how do the outliers differ, what are the differences in the means, range, etc? #bn1 and bn2 have slightly different means, at 10 in bn1 vs. 11 in bn2. However both bn1 and bn2 have identical median & means within each dataset, which is why both have symmetrical distributions and do not skew left or right. Both bn1 and bn2 have a fairly significant amount of outliers, however bn1’s outliers appear to be slightly more dispersed and extend farther beyond the dataset relative to bn2’s outliers #n2 has a much larger range than n1, which means n1’s data is more condensed. n1 has a slightly lower median than mean, vs. n2 which has a higher median than mean; this is why n1 skews slightly right and n2 skews left. Both datasets do not have many outliers, however n2 has an outlier both above the max and below the min, while n1 only has an outlier above the max

When considering the vector individually, check if the distribution is normal or not – which stats would you use to determine this? #to check if the distribution is normal, you can check if 1) it is a symmetrical bell shape, 2) the mean and median are equal, and 3) 68% of the data falls within the 1 standard deviation, 95% within 2 standard deviations, and 99.7% within 3 standard deviations