Intro to Data Science - HW 4

Copyright 2021, Jeffrey Stanton, Jeffrey Saltz, and Jasmina Tacheva

# Enter your name here: Joseph Silvestri

Attribution statement: (choose only one and delete the rest)

# 1. I did this homework by myself, with help from the book and the professor.

(Chapters 8, 9, and 10 of Introduction to Data Science)

Reminders of things to practice from previous weeks:
Descriptive statistics: mean( ) max( ) min( )
Sequence operator: : (For example, 1:4 is shorthand for 1, 2, 3, 4)
Create a function: myFunc <- function(myArg) { }
?command: Ask R for help with a command

This module: Sampling is a process of drawing elements from a larger set. In data science, when analysts work with data, they often work with a sample of the data, rather than all of the data (which we call the population), because of the expense of obtaining all of the data.

One must be careful, however, because statistics from a sample rarely match the characteristics of the population. The goal of this homework is to sample from a data set several times and explore the meaning of the results. Before you get started make sure to read Chapters 8-10 of An Introduction to Data Science. Don’t forget your comments!

Part 1: Write a function to compute statistics for a vector of numeric values

Create a new function which takes a numeric vector as its input argument and returns a dataframe of statistics about that vector as the output. As a start, the dataframe should have the min, mean, and max of the vector. The function should be called vectorStats:

vectorStats<-function(vector){
  Min1<-min(vector)
  Mean1<-mean(vector)
  Max1<-max(vector)
cat("Minimum is :",Min1,"\n")
  cat("Mean is: ",Mean1 ,"\n")
  cat("Maximum is :",Max1,"\n")}

Test your function by calling it with the numbers one through ten:

vector<-(1:10)
vectorStats(vector)

## Minimum is : 1 
## Mean is:  5.5 
## Maximum is : 10

Enhance the vectorStats() function to add the median and standard deviation to the returned dataframe.

vectorStats<-function(vector){
  Min1<-min(vector)
  Mean1<-mean(vector)
  Med1<-median(vector)
  SD1<-sd(vector)
  Max1<-max(vector)
cat("Minimum is :",Min1,"\n")
  cat("Mean is: ",Mean1 ,"\n")
  cat("Maximum is :",Max1,"\n")
cat("Median is :",Med1,"\n")
cat("Standard deviation is :",SD1,"\n")}

Retest your enhanced function by calling it with the numbers one through ten:

vector<-(1:10)
vectorStats(vector)

## Minimum is : 1 
## Mean is:  5.5 
## Maximum is : 10 
## Median is : 5.5 
## Standard deviation is : 3.02765

Part 2: Sample repeatedly from the mtcars built-in dataframe

Copy the mtcars dataframe:

myCars <- mtcars

Use head(myCars) and tail(myCars) to show the data. Add a comment that describes what each variable in the data set contains.
Hint: Use the ? or help( ) command with mtcars to get help on this dataset.

head(myCars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

tail(myCars)

##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

help(mtcars)

## starting httpd help server ... done

#MPG- Miles Per Gallon
#Cyl- Number of Cycles
#Disp- Displacement
#hp- Gross engine horsepower
#Drat- Rear Axel Ratio
#wt- Weight in thousands
#qsec- Quarter mile time
#vs- 0=V-Shaped and 1 Straight line
#am- Transmission 0=auto, 1=Manual
#Gear- number of gears
#carb- number of carburetors

Sample three observations from myCars$mpg.

sample(myCars$mpg,3,replace = FALSE)

## [1] 33.9 22.8 19.2

Call your vectorStats( ) function with a new sample of three observations from myCars$mpg, where the sampling is done inside the vectorStats function call. Then use the mean function, with another sample done inside the mean function. Is the mean returned from the vectorStats function from the first sample the same as the mean returned from the mean function on the second sample? Why or Why not?

vector<-(myCars$mpg)
vectorStats(vector)

## Minimum is : 10.4 
## Mean is:  20.09062 
## Maximum is : 33.9 
## Median is : 19.2 
## Standard deviation is : 6.026948

mean(sample(myCars$mpg,3, replace = FALSE))

## [1] 18

#The means are different because the sample sizes are different and thus the sample may not be representative to the populations mean.

Use the replicate( ) function to repeat your sampling of mtcars ten times, with each sample calling mean() on three observations. The first argument to replicate( ) is the number of repeats you want. The second argument is the little chunk of code you want repeated.

replicate(10,mean(sample(mtcars$mpg, size=3)),simplify = TRUE)

##  [1] 16.36667 17.30000 24.86667 18.06667 22.36667 23.56667 21.10000 20.73333
##  [9] 18.16667 20.60000

Write a comment describing why every replication produces a different result.

#Replication is taking the means of different samples within the population and thus the mean can be different for different samples of cars.

Rerun your replication, this time doing 1000 replications and storing the output of replicate() in a variable called values.

values<-replicate(1000,mean(sample(mtcars$mpg, size=3)),simplify = TRUE)

Generate a histogram of the means stored in values.

jpeg(filename = "MyCarsMPG.jpeg")
hist(values, main = "My Cars MPG", prob=TRUE, breaks =50)

Repeat the replicated sampling, but this time, raise your sample size from 3 to 22.

values22<-replicate(1000,mean(sample(mtcars$mpg, size=22)),simplify = TRUE)
jpeg(filename = "MyCarsMPGSize22.jpeg")
hist(values22, main = " My Cars MPG (s=22)", prob=TRUE, breaks =50)

M. Compare the two histograms - why are they different? Explain in a comment.

#The histograms are different based on the sample sizes, since the second is larger it closer to the true mean of the population 20 vs 21. In comparison to the smaller sample size it was closer to 18 and had longer left/right tails because the size of the sample determines its representativeness.