This project will demonstrate your ability to do exploratory data analysis on single variables of data in R and RStudio. The entire project will use the NSCC Student Dataset, which you will need to load into R in question 1.
Download the “nscc_student_data.csv” file from Blackboard and use the read.csv() function to store it into an object called “nscc_student_data”. Print the first few lines of the dataset using the head() function. Also print the structure of the dataset using the str() function.
#Storing the dataset from Blackboard into R Studio and then printing the first 6 rows of the dataset. Checking the structure of it to find variables and objects.
# Load dataset in and store as object "nscc_student_data"
nscc_student_data <- read.csv("nscc_student_data.csv")
# Preview first 6 lines of dataset
head(nscc_student_data)
## Gender PulseRate CoinFlip1 CoinFlip2 Height ShoeLength Age Siblings RandomNum
## 1 Female 64 5 5 62 11.00 19 4 797
## 2 Female 75 4 6 62 11.00 21 3 749
## 3 Female 74 6 1 60 10.00 25 2 13
## 4 Female 65 4 4 62 10.75 19 1 613
## 5 Female NA NA NA 66 NA 26 6 53
## 6 Female 72 6 5 67 9.75 21 1 836
## HoursWorking Credits Birthday ProfsAge Coffee VoterReg
## 1 35 13 July 5 31 No Yes
## 2 25 12 December 27 30 Yes Yes
## 3 30 6 January 31 29 Yes No
## 4 18 9 6-13 31 Yes Yes
## 5 24 15 02-15 32 No Yes
## 6 15 9 april 14 32 No Yes
# Structure of dataset
str(nscc_student_data)
## 'data.frame': 40 obs. of 15 variables:
## $ Gender : chr "Female" "Female" "Female" "Female" ...
## $ PulseRate : int 64 75 74 65 NA 72 72 60 66 60 ...
## $ CoinFlip1 : int 5 4 6 4 NA 6 6 3 7 6 ...
## $ CoinFlip2 : int 5 6 1 4 NA 5 6 5 8 5 ...
## $ Height : num 62 62 60 62 66 ...
## $ ShoeLength : num 11 11 10 10.8 NA ...
## $ Age : int 19 21 25 19 26 21 19 24 24 20 ...
## $ Siblings : int 4 3 2 1 6 1 2 2 3 1 ...
## $ RandomNum : int 797 749 13 613 53 836 423 16 12 543 ...
## $ HoursWorking: int 35 25 30 18 24 15 20 0 40 30 ...
## $ Credits : int 13 12 6 9 15 9 15 15 13 16 ...
## $ Birthday : chr "July 5" "December 27" "January 31" "6-13" ...
## $ ProfsAge : int 31 30 29 31 32 32 28 28 31 28 ...
## $ Coffee : chr "No" "Yes" "Yes" "Yes" ...
## $ VoterReg : chr "Yes" "Yes" "No" "Yes" ...
a.) What are the dimensions of the nscc_student_data dataframe?
# Find the dimensions of the nscc_student_data dataframe
dim(nscc_student_data)
## [1] 40 15
The dimensions of the nscc_student_data dataframe is 40 by 15. 40 objects of 15 variables.
b.) The chunk of code below will tell you how many values in the PulseRate variable exist (FALSE) and how many are NA (TRUE). How many values are in the variable are missing?
# How many values in PulseRate variable are missing
table(is.na(nscc_student_data$PulseRate))
##
## FALSE TRUE
## 38 2
2 of the values in the variable are missing.
Use an r chunk to calculate the mean, median, and sample standard deviation of the pulse rate variable. Do the mean and median differ by much? If yes, explain why and which would be a better choice as the “center” or “average” of this variable.
#Within this r chunk I am calculating the mean, the median, and standard deviation as the question states. I do so with the mean(), median(), and sd() functions, also accounting for the NA in the dataset by setting na.rm = TRUE otherwise the output would be NA.
mean(nscc_student_data$PulseRate, na.rm = TRUE)
## [1] 73.47368
median(nscc_student_data$PulseRate, na.rm = TRUE)
## [1] 70.5
sd(nscc_student_data$PulseRate, na.rm =TRUE)
## [1] 12.51105
The mean and median differ by much. The most accurate output is that of the median, as it accounts for outliers, while the mean tends to be skewed.
Use an r chunk to calculate the Five Number Summary and IQR of the pulse rate variable? Clearly state your answer for each below the r chunk. Based on the definition of outliers being more than 1.5 IQRs below Q1 or above Q3, are there any values in the pulse rate variable that are considered outliers? Clearly state below the thresholds for a data to be considered an outlier.
#I retrieve the five nunber summary with the function fivenum(), and the IQR with the IQR() function
fivenum(nscc_student_data$PulseRate)
## [1] 50.0 64.0 70.5 85.0 98.0
IQR(nscc_student_data$PulseRate, na.rm = TRUE)
## [1] 19.5
#In order to calculate the thresholds for a data to be considered an outlier I utilized the five number summary and the IQR outputs. Since outliers are 1.5 IQRs below Q1 or above Q3, I first calculate 1.5 * 19.5 which will give me what I can add to Q3 and subtract from Q1.
1.5 * 19.5
## [1] 29.25
#I then add 29.25 to 85.0 (Q3) and subtract 29.25 from 64 (Q1) to obtain my thresholds for a data outlier.
29.25 + 85.0
## [1] 114.25
64 - 29.25
## [1] 34.75
The five number summary of the PulseRate variable is: 50.0 64.0 70.5
85.0 98.0
The IQR of the PulseRate variable is: 19.5
Any data values that are below ( 34.75 ) or above ( 114.25 ) would be considered to be outliers. Based on that, there are not any outliers for this variable.
The Gender variable gives whether students identified as male or female. Create a table and a barplot of that variable.
#Creating a table of the dataset with the table() function...
table(nscc_student_data$Gender)
##
## Female Male
## 27 13
#In order to make a barplot of the variable we have to use the function barplot() and make a barplot of the table.
barplot(table(nscc_student_data$Gender))
Split the dataframe into two subsets – one that has all the males and another that has all the females. Store them into objects called “NSCC_males” and “NSCC_females”. The first one has been done for you as a template.
# Create males subset
NSCC_males <- subset(nscc_student_data, nscc_student_data$Gender == "Male")
# Create females subset
NSCC_females <- subset(nscc_student_data, nscc_student_data$Gender == "Female")
Use an r chunk below to generate the information which will give you the Five Number Summary for the pulse rate variable for each of the male and female subsets.
#We use the fivenumn() function and specify the variable $PulseRate within each subset.
# Five Number Summary of males subset
fivenum(NSCC_males$PulseRate)
## [1] 50 60 71 80 96
# Five Number Summary of females subset
fivenum(NSCC_females$PulseRate)
## [1] 56 65 70 88 98
The Five Number Summary of each dataset are:
Males: 50 60 71 80 96 Females: 56 65 70 88 98
Create side-by-side boxplots for the pulse rate variable each of the male and female subsets. Is there any noticeable difference between the two subsets?
# Create side-by-side boxplots for each subset. We achieve this with the boxplot() function and separating the two objects with a ,
boxplot(NSCC_males$PulseRate, NSCC_females$PulseRate)
There is a noticeable difference between the two subsets. The female
subset has a median close to its Q1, far from its Q3, while the IQR for
subset 1 is relatively balanced.
Create a frequency distribution for how many males and females answered “Yes” or “No” to the variable “Coffee” by using the table() function. What percent of this sample of NSCC students drink coffee? Is there any noticeable difference in coffee drinking based on gender?
# Male Coffee Drinkers
table(NSCC_males$Coffee)
##
## No Yes
## 3 10
# Females Coffee Drinkers
table(NSCC_females$Coffee)
##
## No Yes
## 7 20
# Percent of Males that Drink Coffee
10/13 * 100
## [1] 76.92308
# Percent of Females that Drink Coffee
20/27 * 100
## [1] 74.07407
#Percent of this sample of NSCC students that drink coffee can be found by calculating the total number of students male and female....
13+27
## [1] 40
#Since 10 males and 20 females drink coffee of the 40 total, we can calculate the percent by dividing 30 over 40 and multiplying by 100...
30/40 * 100
## [1] 75
75% of this sample of NSCC students drink coffee. Observing the statistics, there is a noticeable difference between coffee drinking based on gender, but it is very slight. Then again, this sample is tiny and is unlikely to be generalized to a more significant population. For a more accurate answer on coffee drinking based on gender, there would need to be a much larger pool of participants.