## Warning: package 'rmarkdown' was built under R version 3.3.3
## Warning: package 'knitr' was built under R version 3.3.3
This project will demonstrate your ability to do exploratory data analysis on single variables of data in R and RStudio. The entire project will use the NSCC Student Dataset, which you will download and load into R in question 1.
Download the “nscc_student_data.csv” file from MyOpenMath and use the read.csv() function to store it into an object called “nscc_student_data”. Print the first few lines of each data set using the head() function. Print a summary of each data set using the str() function.
# Load dataset in and store as object "nscc_student_data"
nscc_student_data <- read.csv("C:/Users/Henhoag/Desktop/Math143H/Projects/nscc_student_data.csv")
str(nscc_student_data)
## 'data.frame': 40 obs. of 15 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 2 2 1 2 ...
## $ PulseRate : int 64 75 74 65 NA 72 72 60 66 60 ...
## $ CoinFlip1 : int 5 4 6 4 NA 6 6 3 7 6 ...
## $ CoinFlip2 : int 5 6 1 4 NA 5 6 5 8 5 ...
## $ Height : num 62 62 60 62 66 ...
## $ ShoeLength : num 11 11 10 10.8 NA ...
## $ Age : int 19 21 25 19 26 21 19 24 24 20 ...
## $ Siblings : int 4 3 2 1 6 1 2 2 3 1 ...
## $ RandomNum : int 797 749 13 613 53 836 423 16 12 543 ...
## $ HoursWorking: int 35 25 30 18 24 15 20 0 40 30 ...
## $ Credits : int 13 12 6 9 15 9 15 15 13 16 ...
## $ Birthday : Factor w/ 40 levels "02-15","03.14.1984",..: 32 25 30 18 1 21 19 27 35 31 ...
## $ ProfsAge : int 31 30 29 31 32 32 28 28 31 28 ...
## $ Coffee : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 2 2 1 ...
## $ VoterReg : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 1 ...
# Preview of dataframe
head(nscc_student_data,3)
## Gender PulseRate CoinFlip1 CoinFlip2 Height ShoeLength Age Siblings
## 1 Female 64 5 5 62 11 19 4
## 2 Female 75 4 6 62 11 21 3
## 3 Female 74 6 1 60 10 25 2
## RandomNum HoursWorking Credits Birthday ProfsAge Coffee VoterReg
## 1 797 35 13 July 5 31 No Yes
## 2 749 25 12 December 27 30 Yes Yes
## 3 13 30 6 January 31 29 Yes No
# Summary of dataframe with str()
str(nscc_student_data)
## 'data.frame': 40 obs. of 15 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 2 2 1 2 ...
## $ PulseRate : int 64 75 74 65 NA 72 72 60 66 60 ...
## $ CoinFlip1 : int 5 4 6 4 NA 6 6 3 7 6 ...
## $ CoinFlip2 : int 5 6 1 4 NA 5 6 5 8 5 ...
## $ Height : num 62 62 60 62 66 ...
## $ ShoeLength : num 11 11 10 10.8 NA ...
## $ Age : int 19 21 25 19 26 21 19 24 24 20 ...
## $ Siblings : int 4 3 2 1 6 1 2 2 3 1 ...
## $ RandomNum : int 797 749 13 613 53 836 423 16 12 543 ...
## $ HoursWorking: int 35 25 30 18 24 15 20 0 40 30 ...
## $ Credits : int 13 12 6 9 15 9 15 15 13 16 ...
## $ Birthday : Factor w/ 40 levels "02-15","03.14.1984",..: 32 25 30 18 1 21 19 27 35 31 ...
## $ ProfsAge : int 31 30 29 31 32 32 28 28 31 28 ...
## $ Coffee : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 2 2 1 ...
## $ VoterReg : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 1 ...
a.) What are the dimensions of the nscc_student_data dataframe?
dim(nscc_student_data)
## [1] 40 15
The nscc_student_data dataframe contains 40 rows (observations) and 15 columns (variables).
b.) The chunk of code below will tell you how many values in the PulseRate variable exist (FALSE) and how many are NA (TRUE). How many values are in the variable are missing?
# How many values in PulseRate variable are missing
table(is.na(nscc_student_data$PulseRate))
##
## FALSE TRUE
## 38 2
There are 2 missing values in the PulseRate variable.
What is the mean of the pulse rate variable? What is the median of the pulse rate variable? Do they differ by much? If yes, explain why and which would be a better choice as the “center” or “average” of this variable.
#Calculate the mean and median of the pulse rate variable.
mean(nscc_student_data$PulseRate, na.rm = TRUE)
## [1] 73.47368
median(nscc_student_data$PulseRate,na.rm = TRUE)
## [1] 70.5
The mean of the pulse rate is 73.47, while the median is 70.5. I think it’s difficult to say if these results differ without knowing the range of the variable. For example, if the total range is 69 to 75, then the mean and median are very different. Let’s look at the range.
range(nscc_student_data$PulseRate, na.rm = TRUE)
## [1] 50 98
Now I have more confidence in saying that the median and the mean don’t differ by much. The values of PulseRate range over 48 points, so a difference of 3 between mean and median is small (~6% of the range).
What is the sample standard deviation of the pulse rate variable?
sd(nscc_student_data$PulseRate, na.rm = TRUE)
## [1] 12.51105
The standard deviation of the PulseRate values is 12.51.
What is the Five Number Summary and IQR of the pulse rate variable? Based on the definition of outliers being more than 1.5 IQRs above Q3 or below Q1, are there any values in the pulse rate variable that are considered outliers?
# Five Number Summary of the PulseRate variable
summary(nscc_student_data$PulseRate)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 50.00 64.25 70.50 73.47 83.75 98.00 2
The IQR of the PulseRate variable is: 83.75 - 64.25 = 19.5.
Any values below 35 = 64.25 - 1.5 x 19.5 or above 113 = 83.75 + 1.5 x 19.5 would be considered to be outliers. Based on that, there are no outliers. The minimum value is 50 and the maximum value is 98, well within the bounds set by 1.5xIQR.
The Gender variable gives whether students identified as male or female. Create a table and a barplot of that variable.
# Table of results of Gender variable
table(nscc_student_data$Gender)
##
## Female Male
## 27 13
# Barplot of results
barplot(table(nscc_student_data$Gender),main = "Gender of Students Surveyed",ylab = "Number of Students")
Split the dataframe into two subsets – one that has all the males and another that has all the females. Store them into objects called “NSCC_males” and “NSCC_females”. The first one has been done for you as a template.
# Create males subset
NSCC_males <- subset(nscc_student_data, nscc_student_data$Gender == "Male")
# Create females subset
NSCC_females <- subset(nscc_student_data, nscc_student_data$Gender == "Female")
What is the Five Number Summary for the pulse rate variable for each of the male and female subsets.
# Five Number Summary of males subset
summary(NSCC_males$PulseRate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.00 60.00 71.00 70.85 80.00 96.00
# Five Number Summary of females subset
summary(NSCC_females$PulseRate)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 56.00 65.00 70.00 74.84 88.00 98.00 2
The Five Number Summary of each Pulse Rate dataset are:
Males: min=50, Q1=60, median=71, Q3=80, max=96
Females: min=56, Q1=65, median=70, Q3=88, max=98
Create side-by-side boxplots for the pulse rate variable each of the male and female subsets. Is there any noticeable difference between the spread of the variables?
# Create side-by-side boxplots for each subset
boxplot(NSCC_males$PulseRate,NSCC_females$PulseRate, names = c("Male","Female"),main="Pulse Rates of NSCC Students Surveyed",xlab="Gender",ylab="Pulse Rates")
Even though the median values are similar, the female pulse rates are skewed to higher values. For example, the female minimum is higher than the male by 6 points and the female Q3 is higher than the male Q3 by 8 points.
Create a frequency distribution for how many males and females answered “Yes” or “No” to the variable “Coffee” by using the table() function. What percent of this sample of NSCC students drink coffee? Is there any noticeable difference in coffee drinking based on gender?
# Male Coffee Drinkers
table(NSCC_males$Coffee)
##
## No Yes
## 3 10
# Females Coffee Drinkers
table(NSCC_females$Coffee)
##
## No Yes
## 7 20
# Percent of Males that Drink Coffee
males_coffee<-subset(NSCC_males,Coffee == "Yes")
nrow(males_coffee)/nrow(NSCC_males)*100
## [1] 76.92308
# Percent of Females that Drink Coffee
females_coffee<-subset(NSCC_females,Coffee == "Yes")
nrow(females_coffee)/nrow(NSCC_females)*100
## [1] 74.07407
# Percent of all students surveyed who drink coffee
coffee_drinkers<-subset(nscc_student_data,Coffee == "Yes")
nrow(coffee_drinkers)/nrow(nscc_student_data)*100
## [1] 75
75% of all students surveyed \((\frac{\text{30}}{\text{40}})\) drink coffee.
If we break it down by gender, 76.92% of males surveyed \((\frac{\text{10}}{\text{13}})\) drink coffee and 74.07% of females surveyed \((\frac{\text{20}}{\text{27}})\) drink coffee. The gender specific results are not different from the overall survey results.