Library

The packages that I have installed include the following:

  • dplyr
  • ggplot2
  • tidyverse
  • plotly

Installing packages below.

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ✔ readr     2.1.5     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'plotly'
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## 
## The following object is masked from 'package:graphics':
## 
##     layout
##   gender race.ethnicity parental.level.of.education        lunch
## 1 female        group B           bachelor's degree     standard
## 2 female        group C                some college     standard
## 3 female        group B             master's degree     standard
## 4   male        group A          associate's degree free/reduced
## 5   male        group C                some college     standard
## 6 female        group B          associate's degree     standard
##   test.preparation.course math.score reading.score writing.score
## 1                    none         72            72            74
## 2               completed         69            90            88
## 3                    none         90            95            93
## 4                    none         47            57            44
## 5                    none         76            78            75
## 6                    none         71            83            78

Linear Regression

Linear Regression is an statistical model that can analyzes the relationship between a dependent variable (y) and one or more independent variables (x). The goal of linear regression is to predict the value of the dependent variable based on the independent variables. The equation for linear regression is below.

\[ y = \beta_0 + \beta_1x_1 + \epsilon \]

Summary of Data

The dataset that I have chosen is about students’ academic performance. The columns that are included this dataset are gender, race/ethnicity, parental level of education, lunch, test preparation course, math score, reading score, and writing score. There are 1,000 entries

##   gender race.ethnicity parental.level.of.education        lunch
## 1 female        group B           bachelor's degree     standard
## 2 female        group C                some college     standard
## 3 female        group B             master's degree     standard
## 4   male        group A          associate's degree free/reduced
## 5   male        group C                some college     standard
## 6 female        group B          associate's degree     standard
##   test.preparation.course math.score reading.score writing.score
## 1                    none         72            72            74
## 2               completed         69            90            88
## 3                    none         90            95            93
## 4                    none         47            57            44
## 5                    none         76            78            75
## 6                    none         71            83            78

Plotly Plot: Math vs. Reading Scores

There is a positive correlation between math scores and reading scores. This means that when students’ do well on their math score, they also do well on their reading score. Most of the data ranges between 40% and 90% in both math and reading scores.

R code: Math vs. Reading Scores

plotly1 <- plot_ly(data = StudentsPerformance_1_, x = ~math.score, y = ~reading.score, type = “scatter”, mode = “markers”, name = “Data”, marker = list(color = “purple”, size = 5)) %>% layout(title = “Math vs. Reading Scores”, xaxis = list(title = “Math Score”), yaxis = list(title = “Reading Score”))

plotly1

Plotly Plot: Test Prep Course Completion

Among all students, almost two-thirds did not complete a test preparation course. Specifically, 358 students did complete this course while 642 students did not.

R code: Test Prep Course Completion

plotly2 <- plot_ly(data = StudentsPerformance_1_, x = ~test.preparation.course, type = “histogram”, marker = list(color = “pink”)) %>% layout(title = “Test Preparation Course Completion”, xaxis = list(title = “Completed Course”), yaxis = list(title = “Count”, range = c(0,800)))

plotly2

Plotly Plot: Test Prep by Avg Scores

This plot shows that there is a 7% difference in average scores between students who completed the test preparation course and those who did not. The overall averages for math, reading, and writing scores are 72.67% for courae completers and and 65.04% for non-completers.

R code: Test Prep by Avg Scores

avgscores <- StudentsPerformance_1_ %>% group_by(test.preparation.course) %>% summarize(avgscores = mean((math.score + writing.score + reading.score) / 3))

plotly4 <- plot_ly(data = avgscores, x = ~test.preparation.course, y = ~avgscores, type = “bar”, color = ~test.preparation.course, colors = c(“orange”, “darkblue”)) %>% layout(title = “Test Prep Course vs Average Scores”, xaxis = list(title = “Test Preparation Course”), yaxis = list(title = “Average Score - Math, Reading, and Writing”, range = c(0, 100)))

plotly4

Ggplot: Writing Scores by Gender

In this dataset, females generally outperform males in writing scores. The majority of female students score in the 60-100% range, while males fall between 50-90%. Overall, most students’ writing scores range from 45-100%.

R code: Writing Scores by Gender

ggplot1 <- ggplot(StudentsPerformance_1_, aes(x = writing.score, y = writing.score, color = gender)) + geom_jitter(size = 1, width = 12, height = 12) + labs(title = “Writing Scores By Gender”, xaxis = “Writing Score”, y = “Writing Score”) + scale_color_manual(values = c(“red”, “blue”))

Ggplot: Lunch by Parental Education

The chart reveals that the majority students have standard lunches, as opposed to free or reduced lunches. The most significant discrepencies are assoiaite’s degrees, high school diplomas, some college experience and some high school experience. In contrast, the differences for master’s degrees and bachelor’s degrees are much smaller.

R code: Lunch by Parental Level of Education

ggplot2 <- ggplot(StudentsPerformance_1_, aes(x = lunch, fill = parental.level.of.education)) + geom_bar(position = “dodge”) + labs(title = “Lunch by Parental Level of Education”, x = “Lunch Type”, y = “Number of Students”)

ggplot2

First Math Text written in Latex: Mean Formula

\[ \bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_i \]

Second Math Text written in Latex: Linerar Regression Equation

\[ y = \beta_0 + \beta_1x_1 + \epsilon \]