A Biology tutor in a high school in Clark County seeks to assess the performance of his Sophomore students. He undertook a Biology quiz and collected the results of the students after grading. The dataset that contains the names, study hours and scores of the students.
library(readr)
abc <- "https://raw.githubusercontent.com/MicrosoftDocs/ml-basics/master/data/grades.csv"
data <- read_csv(abc)
## Rows: 24 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Name
## dbl (2): StudyHours, Grade
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data
## # A tibble: 24 × 3
## Name StudyHours Grade
## <chr> <dbl> <dbl>
## 1 Dan 10 50
## 2 Joann 11.5 50
## 3 Pedro 9 47
## 4 Rosie 16 97
## 5 Ethan 9.25 49
## 6 Vicky 1 3
## 7 Frederic 11.5 53
## 8 Jimmie 9 42
## 9 Rhonda 8.5 26
## 10 Giovanni 14.5 74
## # ℹ 14 more rows
This is the first 10 rows of the dataset.
tail(data)
## # A tibble: 6 × 3
## Name StudyHours Grade
## <chr> <dbl> <dbl>
## 1 Anila 10 48
## 2 Skye 12 52
## 3 Daniel 12.5 63
## 4 Aisha 12 64
## 5 Bill 8 NA
## 6 Ted NA NA
Looking at the last five rows of the dataset, it can be seen that some of the students (Bill and Ted) have missing values. Bill’s grade is missing and Ted’s study hours and grade are missing.
Assuming these students actually partook in the quiz and ensuring that their results is included in the analysis, we replace the missing values with the median of the respective variables. This maintains the central tendency of the variables without being overly influenced by extreme values.
data$StudyHours[is.na(data$StudyHours)] <- median(data$StudyHours, na.rm = TRUE)
data$Grade[is.na(data$Grade)] <- median(data$Grade, na.rm = TRUE)
data
## # A tibble: 24 × 3
## Name StudyHours Grade
## <chr> <dbl> <dbl>
## 1 Dan 10 50
## 2 Joann 11.5 50
## 3 Pedro 9 47
## 4 Rosie 16 97
## 5 Ethan 9.25 49
## 6 Vicky 1 3
## 7 Frederic 11.5 53
## 8 Jimmie 9 42
## 9 Rhonda 8.5 26
## 10 Giovanni 14.5 74
## # ℹ 14 more rows
tail(data)
## # A tibble: 6 × 3
## Name StudyHours Grade
## <chr> <dbl> <dbl>
## 1 Anila 10 48
## 2 Skye 12 52
## 3 Daniel 12.5 63
## 4 Aisha 12 64
## 5 Bill 8 49.5
## 6 Ted 10 49.5
As it can be seen, Ted’s study hours has been replaced with the median of study hours. Also, the grade for Bill and Ted have been replaced with the median of grade.
if(!require(ggplot2))install.packages("ggplot2")
## Loading required package: ggplot2
library(ggplot2)
histogram1 <- ggplot(data, aes(StudyHours)) + geom_histogram() +
labs(title = "Histogram of Study Hours",
x = "Study Hours", y = "Frequency") +
theme_minimal()
histogram1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The histogram displays the study hours of the students. It seems to portray a normal distribution with a peak of around 9.5. The frequency of this peak is 4 indicating that the study hours of 4 of the students is around 9.5. There is a noticeable gap in between 1 and 6. This gap indicates that none of the students study for those number of hours. An outlier can be observed on the left of the histogram. This outlier is 1 hour and its frequency is 1 indicating that one of the students has a study hours of 1.
histogram2 <- ggplot(data, aes(Grade)) + geom_histogram() +
labs(title = "Histogram of Grade",
x = "Grade", y = "Frequency") +
theme_minimal()
histogram2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This histogram displays the grade of the students. It also seems to display a normal distribution with most grades clustering around 50 with a frequency of 6.This means that 6 of the students had a grade of around 50 in the Biology quiz. Also, there are noticeable gaps in the histogram indicating the absence of those grade values. There are two outliers, where one is close to 0 and the other is close to 100.
box1 <- ggplot(data, aes(StudyHours)) + geom_boxplot() +
labs(title = "Box Plot of Study Hours",
x = "Study Hours") +
theme_minimal()
box1
This is a box plot of study hours. The box spans from around 9 to 12. The first edge of the box is 9 indicating that the median of the first quartile (25th percentile) of the dataset is approximately 9. The last edge of the box is slightly above 12 indicating that the median of the third quartile (75th percentile) is slightly above 12. The thick line in between the box indicates the median of the dataset. It can be clearly seen that the median of the dataset is exactly 10.
box2 <- ggplot(data, aes(Grade)) + geom_boxplot() +
labs(title = "Box Plot of Grade",
x = "Grade") +
theme_minimal()
box2
This is a box plot of Grade. The box spans from around around 37.5 to 62.5. The first edge of the box is approximately 37.5 indicating that the median of the first quartile (25th percentile) of the dataset is approximately 37.5. The last edge of the box is approximately 62.5 indicating that the median of the third quartile (75th percentile) is close to 62.5. The thick line in between the box indicates the median of the dataset. As showed, the median of the dataset is approximately 50.
scatter1 <- ggplot(data , aes(StudyHours, Grade)) + geom_point() +
labs(title = "Scatter Plot of Study Hours and Grade",
x = "Study Hours", y = "Grade") + theme_minimal()
scatter1
This is a scatter plot demonstrating the relationship of study hours and grade of the students in the Biology quiz. There is a noticeable positive relationship between study hours and grade. indicating that as the value of the independent variable (Study Hours) increases, the value of the dependent variable (Grade) increases. This can be explained that, as students spend more hours studying, they score a higher grade. There is an outlier of one observation at bottom left of the scatter plot.
scatter2 <- ggplot(data , aes(StudyHours, Grade)) + geom_point() +
geom_smooth(method = "lm", color = "black", se = FALSE) +
labs(title = "Scatter Plot of Study Hours and Grade With Line of Best Fit",
x = "Study Hours", y = "Grade") + theme_minimal()
scatter2
## `geom_smooth()` using formula = 'y ~ x'
The regression line or line of best fit showcases the general trend of the data points. It is positively sloped suggesting a positive relationship between study hours and grade.
The analysis of the visualizations provides more context of the relationship between study hours and grade. The scatter diagram depicts a positive correlation, whereas the histogram and box plots provides a deep understanding of the distribution and variability of study hours and grade.
It can be inferred that encouraging student to increase their study hours will help them achieve higher grades.