Introduction

A Biology tutor in a high school in Clark County seeks to assess the performance of his Sophomore students. He undertook a Biology quiz and collected the results of the students after grading. The dataset that contains the names, study hours and scores of the students.

Loading the dataset

library(readr)
abc <- "https://raw.githubusercontent.com/MicrosoftDocs/ml-basics/master/data/grades.csv"
data <- read_csv(abc)
## Rows: 24 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Name
## dbl (2): StudyHours, Grade
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data
## # A tibble: 24 × 3
##    Name     StudyHours Grade
##    <chr>         <dbl> <dbl>
##  1 Dan           10       50
##  2 Joann         11.5     50
##  3 Pedro          9       47
##  4 Rosie         16       97
##  5 Ethan          9.25    49
##  6 Vicky          1        3
##  7 Frederic      11.5     53
##  8 Jimmie         9       42
##  9 Rhonda         8.5     26
## 10 Giovanni      14.5     74
## # ℹ 14 more rows

This is the first 10 rows of the dataset.

tail(data)
## # A tibble: 6 × 3
##   Name   StudyHours Grade
##   <chr>       <dbl> <dbl>
## 1 Anila        10      48
## 2 Skye         12      52
## 3 Daniel       12.5    63
## 4 Aisha        12      64
## 5 Bill          8      NA
## 6 Ted          NA      NA

Looking at the last five rows of the dataset, it can be seen that some of the students (Bill and Ted) have missing values. Bill’s grade is missing and Ted’s study hours and grade are missing.

Data Exploration

Assuming these students actually partook in the quiz and ensuring that their results is included in the analysis, we replace the missing values with the median of the respective variables. This maintains the central tendency of the variables without being overly influenced by extreme values.

data$StudyHours[is.na(data$StudyHours)] <- median(data$StudyHours, na.rm = TRUE)
data$Grade[is.na(data$Grade)] <- median(data$Grade, na.rm = TRUE)
data
## # A tibble: 24 × 3
##    Name     StudyHours Grade
##    <chr>         <dbl> <dbl>
##  1 Dan           10       50
##  2 Joann         11.5     50
##  3 Pedro          9       47
##  4 Rosie         16       97
##  5 Ethan          9.25    49
##  6 Vicky          1        3
##  7 Frederic      11.5     53
##  8 Jimmie         9       42
##  9 Rhonda         8.5     26
## 10 Giovanni      14.5     74
## # ℹ 14 more rows
tail(data)
## # A tibble: 6 × 3
##   Name   StudyHours Grade
##   <chr>       <dbl> <dbl>
## 1 Anila        10    48  
## 2 Skye         12    52  
## 3 Daniel       12.5  63  
## 4 Aisha        12    64  
## 5 Bill          8    49.5
## 6 Ted          10    49.5

As it can be seen, Ted’s study hours has been replaced with the median of study hours. Also, the grade for Bill and Ted have been replaced with the median of grade.

Data Visualization

if(!require(ggplot2))install.packages("ggplot2")
## Loading required package: ggplot2
library(ggplot2)
histogram1 <- ggplot(data, aes(StudyHours)) + geom_histogram() +
  labs(title = "Histogram of Study Hours",
       x = "Study Hours", y = "Frequency") +
  theme_minimal()
histogram1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histogram displays the study hours of the students. It seems to portray a normal distribution with a peak of around 9.5. The frequency of this peak is 4 indicating that the study hours of 4 of the students is around 9.5. There is a noticeable gap in between 1 and 6. This gap indicates that none of the students study for those number of hours. An outlier can be observed on the left of the histogram. This outlier is 1 hour and its frequency is 1 indicating that one of the students has a study hours of 1.

histogram2 <- ggplot(data, aes(Grade)) + geom_histogram() +
  labs(title = "Histogram of Grade",
       x = "Grade", y = "Frequency") +
  theme_minimal()
histogram2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This histogram displays the grade of the students. It also seems to display a normal distribution with most grades clustering around 50 with a frequency of 6.This means that 6 of the students had a grade of around 50 in the Biology quiz. Also, there are noticeable gaps in the histogram indicating the absence of those grade values. There are two outliers, where one is close to 0 and the other is close to 100.

box1 <- ggplot(data, aes(StudyHours)) + geom_boxplot() +
  labs(title = "Box Plot of Study Hours",
       x = "Study Hours") +
  theme_minimal()

box1

This is a box plot of study hours. The box spans from around 9 to 12. The first edge of the box is 9 indicating that the median of the first quartile (25th percentile) of the dataset is approximately 9. The last edge of the box is slightly above 12 indicating that the median of the third quartile (75th percentile) is slightly above 12. The thick line in between the box indicates the median of the dataset. It can be clearly seen that the median of the dataset is exactly 10.

box2 <- ggplot(data, aes(Grade)) + geom_boxplot() +
  labs(title = "Box Plot of Grade",
       x = "Grade") +
  theme_minimal()

box2

This is a box plot of Grade. The box spans from around around 37.5 to 62.5. The first edge of the box is approximately 37.5 indicating that the median of the first quartile (25th percentile) of the dataset is approximately 37.5. The last edge of the box is approximately 62.5 indicating that the median of the third quartile (75th percentile) is close to 62.5. The thick line in between the box indicates the median of the dataset. As showed, the median of the dataset is approximately 50.

scatter1 <- ggplot(data , aes(StudyHours, Grade)) +  geom_point() + 
                   labs(title = "Scatter Plot of Study Hours and Grade", 
                                         x = "Study Hours", y = "Grade") + theme_minimal()

scatter1

This is a scatter plot demonstrating the relationship of study hours and grade of the students in the Biology quiz. There is a noticeable positive relationship between study hours and grade. indicating that as the value of the independent variable (Study Hours) increases, the value of the dependent variable (Grade) increases. This can be explained that, as students spend more hours studying, they score a higher grade. There is an outlier of one observation at bottom left of the scatter plot.

scatter2 <- ggplot(data , aes(StudyHours, Grade)) +  geom_point() + 
                  geom_smooth(method = "lm", color = "black", se = FALSE) +
                   labs(title = "Scatter Plot of Study Hours and Grade With Line of Best Fit", 
                                         x = "Study Hours", y = "Grade") + theme_minimal()

scatter2
## `geom_smooth()` using formula = 'y ~ x'

The regression line or line of best fit showcases the general trend of the data points. It is positively sloped suggesting a positive relationship between study hours and grade.

Conclusion

The analysis of the visualizations provides more context of the relationship between study hours and grade. The scatter diagram depicts a positive correlation, whereas the histogram and box plots provides a deep understanding of the distribution and variability of study hours and grade.

It can be inferred that encouraging student to increase their study hours will help them achieve higher grades.