library(ggplot2)
library(trelliscopejs)
## This package is no longer maintained. Please use the 'trelliscope' package instead (see https://github.com/trelliscope/).
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ purrr 1.1.0 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
The goal of this graph is to see if students are typically stronger in one subject (reading/math), or get similar results for both subjects. The data contains information for math and reading scores, among many other variables. Each point on the scatterplot will represent one student, and the entire plot will show trends between the scores of the two subjects.
# read in data
students <- read.csv("StudentsPerformance.csv")
# plotting math and reading scores
plot(students$math.score, students$reading.score,
main = "Math Scores vs. Reading Scores",
xlab = "Math Score",
ylab = "Reading Score",
xlim = c(0, 100),
ylim = c(0, 100),
pch = 19,
col = rgb(0, 0, 1, 0.25))
This scatterplot is very useful in comparing the two different types of scores. We can see the positive linear relationship between math and reading scores. This means as a student does better in one subject, they are likely to do better in the other. This shows that most students aren’t “math” students or “reading” students. It seems that students are just “good” or “bad”, meaning they perform well in both or neither. I am a bit surprised by this, since some people seem to be much better at one or the other. We can see that students aren’t typically stronger in one subject than the other, since they tend to score similarly for both.
This plot will investigate the number of Covid deaths in March and April of 2020. This dataset contains information about Covid (such as number of deaths, cases, etc.) from all across the world, but I chose to just focus on the USA. This line plot will show trends over each day.
# read in data
covid <- read.csv("WHO-COVID-19-global-data.csv")
# converting date column
covid$Date_reported <- as.Date(covid$Date_reported, format = "%Y-%m-%d")
# subset for USA and March and April 2020
covidUSA <- covid[
covid$Country == "United States of America" &
format(covid$Date_reported, "%Y") == "2020" &
(format(covid$Date_reported, "%m") == "03" | format(covid$Date_reported, "%m") == "04") ,]
# plotting deaths over time
ggplot(covidUSA, aes(Date_reported, New_deaths)) +
geom_line(col = "navy") +
scale_x_date(
breaks = seq(min(covidUSA$Date_reported),
max(covidUSA$Date_reported),
by = "7 days"),
date_labels = "%b %d") +
labs(title = "Covid Deaths for March and April 2020 in the USA",
x = "Date",
y = "Number of Deaths")
As we can see, Covid deaths are at or almost at zero up until around March 15. This makes sense, since the national quarantine was on March 13th. This line plot is very helpful at seeing how things change over time. After March 15, deaths continued to increase. For the month of April, there are some noticeable spikes and dips in the graphs. After some research about these spikes, I’ve concluded that the number of deaths was not truly varying this much. These spikes in the graph are due to reporting changes. Hospitals have less staff on weekends, so deaths sometimes would wait to be recorded until Mondays or Tuesdays. Meaning, this plot shows the number of deaths recorded by day, not the true number of deaths.
This next dataset looks into habits of students, then records their exam scores. I plan to compare study hours to exam scores to see if number of hours spent studying will result in higher scores. Then, I will facet by gender to see if there are any differences between the groups.
# read in data
habits <- read.csv("student_habits_performance.csv")
# plotting study hours vs exam score, facet by gender
ggplot(habits, aes(study_hours_per_day, exam_score)) +
geom_point(col = "blue", alpha = 0.5) +
labs(x = "Study Hours (per day)",
y = "Exam Score") +
facet_trelliscope(~gender,
name = "Relationship Between Study Hours and Exam Scores",
desc = "Faceted by Gender",
nrow = 2, ncol = 2,
scales = c("same", "same"),
self_contained = TRUE,
path = ".")
## using data from the first layer
Since I wanted to compare study hours and exam scores, a scatterplot was the best choice. Based on the positive trend, we can see that as study hours increase, so will exam scores. There are a few data points that do not follow this trend, but a majority of them do. I also faceted over gender to see if one group studies more or scores higher on exams. Males and females have a similar distribution of exam scores, with the data ranging from failing to 100%. They also have similar study hours, but it seems that females study just a little bit more. It’s also worth noting that I chose to ignore the “Other” category because there were so few data points.
This final plot will be investigating the differences in real estate value and actual sale prices. I will animate the plot over years to see how the housing market has changed. It will also be interesting to see if different types of residential properties change.
# read in data
real_estate <- read.csv("V3.csv")
# taking out incomplete rows
real_estate <- real_estate %>%
filter(!if_any(everything(), ~ is.na(.) | . == "?"))
# taking out values over 1 million
real_estate <- real_estate %>%
filter(Estimated.Value <= 1000000 & Sale.Price <= 1000000)
# plotting estimated value vs sale price
plot_ly(
data = real_estate,
x = ~Estimated.Value,
y = ~Sale.Price,
frame = ~Year,
color = ~Residential,
colors = c("maroon", "blue", "darkgreen", "yellow3"),
text = ~paste("Year:", Year,
"<br>Estimated Value:", Estimated.Value,
"<br>Sale Price:", Sale.Price,
"<br>Type of Residence:", Residential),
hoverinfo = "text",
type = 'scatter',
mode = 'markers',
marker = list(opacity = 0.5)) %>%
layout(
title = "Estimated Value vs. Sale Price",
xaxis = list(title = "Estimated Value"),
yaxis = list(title = "Sale Price")) %>%
animation_opts(frame = 2000, transition = 500, redraw = TRUE) %>%
animation_slider(currentvalue = list(prefix = "Year: ", font = list(size = 24, color = "black")))
This scatterplot is comparing estimated value to the sale price of homes. I subsetted for all the homes to be less than 1 million, since mostly everything above that is unrealistic for the average person. The data is colored by the type of property, with single family homes, duplexes, etc. Since most of the data comes from single family homes, it is the most variable. Obviously, as the estimated value increases, the sale price will as well. However, it is not a 1:1 relationship. Meaning, properties estimated at 0.2 million are not selling at 0.2 million (like we would expect them to). They are actually selling above this value. We can see over the years that this trend (not selling at 1:1) stays constant, but the amount changes. Some years, properties are selling way over the estimated value, and some years they only sell a little bit over the estimated value.