# Enter your name here: Joseph Silvestri
# 1. I did this homework by myself, with help from the book and the professor.
(Chapters 8, 9, and 10 of Introduction to Data Science)
Reminders of things to practice from previous weeks:
Descriptive statistics: mean( ) max( ) min( )
Sequence operator: : (For example, 1:4 is shorthand for 1, 2, 3, 4)
Create a function: myFunc <- function(myArg) { }
?command: Ask R for help with a command
This module: Sampling is a process of drawing elements from a larger set. In data science, when analysts work with data, they often work with a sample of the data, rather than all of the data (which we call the population), because of the expense of obtaining all of the data.
One must be careful, however, because statistics from a sample rarely match the characteristics of the population. The goal of this homework is to sample from a data set several times and explore the meaning of the results. Before you get started make sure to read Chapters 8-10 of An Introduction to Data Science. Donโt forget your comments!
vectorStats<-function(vector){
Min1<-min(vector)
Mean1<-mean(vector)
Max1<-max(vector)
cat("Minimum is :",Min1,"\n")
cat("Mean is: ",Mean1 ,"\n")
cat("Maximum is :",Max1,"\n")}
vector<-(1:10)
vectorStats(vector)
## Minimum is : 1
## Mean is: 5.5
## Maximum is : 10
vectorStats<-function(vector){
Min1<-min(vector)
Mean1<-mean(vector)
Med1<-median(vector)
SD1<-sd(vector)
Max1<-max(vector)
cat("Minimum is :",Min1,"\n")
cat("Mean is: ",Mean1 ,"\n")
cat("Maximum is :",Max1,"\n")
cat("Median is :",Med1,"\n")
cat("Standard deviation is :",SD1,"\n")}
vector<-(1:10)
vectorStats(vector)
## Minimum is : 1
## Mean is: 5.5
## Maximum is : 10
## Median is : 5.5
## Standard deviation is : 3.02765
myCars <- mtcars
Use head(myCars) and tail(myCars) to show the data. Add a comment that describes what each variable in the data set contains.
Hint: Use the ? or help( ) command with mtcars to get help on this dataset.
head(myCars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
tail(myCars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
help(mtcars)
## starting httpd help server ... done
#MPG- Miles Per Gallon
#Cyl- Number of Cycles
#Disp- Displacement
#hp- Gross engine horsepower
#Drat- Rear Axel Ratio
#wt- Weight in thousands
#qsec- Quarter mile time
#vs- 0=V-Shaped and 1 Straight line
#am- Transmission 0=auto, 1=Manual
#Gear- number of gears
#carb- number of carburetors
sample(myCars$mpg,3,replace = FALSE)
## [1] 33.9 22.8 19.2
vector<-(myCars$mpg)
vectorStats(vector)
## Minimum is : 10.4
## Mean is: 20.09062
## Maximum is : 33.9
## Median is : 19.2
## Standard deviation is : 6.026948
mean(sample(myCars$mpg,3, replace = FALSE))
## [1] 18
#The means are different because the sample sizes are different and thus the sample may not be representative to the populations mean.
replicate(10,mean(sample(mtcars$mpg, size=3)),simplify = TRUE)
## [1] 16.36667 17.30000 24.86667 18.06667 22.36667 23.56667 21.10000 20.73333
## [9] 18.16667 20.60000
#Replication is taking the means of different samples within the population and thus the mean can be different for different samples of cars.
values<-replicate(1000,mean(sample(mtcars$mpg, size=3)),simplify = TRUE)
jpeg(filename = "MyCarsMPG.jpeg")
hist(values, main = "My Cars MPG", prob=TRUE, breaks =50)
values22<-replicate(1000,mean(sample(mtcars$mpg, size=22)),simplify = TRUE)
jpeg(filename = "MyCarsMPGSize22.jpeg")
hist(values22, main = " My Cars MPG (s=22)", prob=TRUE, breaks =50)
M. Compare the two histograms - why are they different? Explain in a comment.
#The histograms are different based on the sample sizes, since the second is larger it closer to the true mean of the population 20 vs 21. In comparison to the smaller sample size it was closer to 18 and had longer left/right tails because the size of the sample determines its representativeness.