Data used from DS Labs - Stars.
library(stringi)
stri_rand_lipsum(1)
## [1] "Lorem ipsum dolor sit amet, in ut ac etiam varius vehicula libero, sed molestie inceptos eros. Facilisis, non montes dolor iaculis pellentesque odio suscipit vestibulum. Vel purus morbi. Praesent posuere, sociis augue sed donec in. Sed egestas torquent dapibus vestibulum velit suscipit commodo taciti! Arcu pulvinar per commodo tincidunt diam augue aliquam! Sociosqu quisque praesent non ut sit. Faucibus in congue aptent, lacus sed dolor in."
# Pending intro text
Packages <- c("tidyverse","dplyr","ggthemes","dslabs","scales","grid","gridExtra")
invisible(lapply(Packages, library, character.only = TRUE))
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
##
## 다음의 패키지를 부착합니다: 'scales'
##
##
## The following object is masked from 'package:purrr':
##
## discard
##
##
## The following object is masked from 'package:readr':
##
## col_factor
##
##
##
## 다음의 패키지를 부착합니다: 'gridExtra'
##
##
## The following object is masked from 'package:dplyr':
##
## combine
data(package="dslabs")
list.files(system.file("script", package = "dslabs"))
## [1] "make-admissions.R"
## [2] "make-brca.R"
## [3] "make-brexit_polls.R"
## [4] "make-death_prob.R"
## [5] "make-divorce_margarine.R"
## [6] "make-gapminder-rdas.R"
## [7] "make-greenhouse_gases.R"
## [8] "make-historic_co2.R"
## [9] "make-mnist_27.R"
## [10] "make-movielens.R"
## [11] "make-murders-rda.R"
## [12] "make-na_example-rda.R"
## [13] "make-nyc_regents_scores.R"
## [14] "make-olive.R"
## [15] "make-outlier_example.R"
## [16] "make-polls_2008.R"
## [17] "make-polls_us_election_2016.R"
## [18] "make-reported_heights-rda.R"
## [19] "make-research_funding_rates.R"
## [20] "make-stars.R"
## [21] "make-temp_carbon.R"
## [22] "make-tissue-gene-expression.R"
## [23] "make-trump_tweets.R"
## [24] "make-weekly_us_contagious_diseases.R"
## [25] "save-gapminder-example-csv.R"
data("stars")
Pending summary of EDA process
We check min and max values for graphing needs as well as check for NA values.
summary(stars)
## star magnitude temp type
## Altair : 2 Min. :-8.000 Min. : 2500 Length:96
## *40EridaniA: 1 1st Qu.:-1.800 1st Qu.: 3168 Class :character
## *40EridaniB: 1 Median : 2.400 Median : 5050 Mode :character
## *40EridaniC: 1 Mean : 4.257 Mean : 8752
## *61CygniA : 1 3rd Qu.:11.325 3rd Qu.: 9900
## *61CygniB : 1 Max. :17.000 Max. :33600
## (Other) :89
colSums(is.na(stars))
## star magnitude temp type
## 0 0 0 0
Pending summary of data manipulation
Here we add a grouping column. The purpose of this is to add a layer to our graphs as it relates to a general category of temperatures.
stars <- stars[order(stars$temp),]
stars <- stars %>%
mutate(groupTemp = ifelse(temp < 4000, "Less than 4,000",
ifelse(temp >= 4000 & temp < 6000, "Between 4,000 and 6,000",
ifelse(temp >= 6000 & temp < 8000, "Between 6,000 and 8,000",
ifelse(temp >= 8000 & temp < 10000, "Between 8,000 and 10,000",
ifelse(temp >= 10000, "Greater than 10,000", NA))))))
Comments describing all chunks of code Meaningful labels for X and y axes Meaningful title A theme for the graph Colors for a third variable with a legend
This initial graph represents the standard ggplot graph of variables Magnitude and Temperature.
ggplot(stars, aes(x = temp, y = magnitude, color = groupTemp)) +
xlab("Temperature of the Stars") +
ylab("Magnitude (Measure of Brightness)") +
ggtitle("Star's Temperature as Compared to its Brightness") +
geom_point(size = 2,alpha = .8) +
geom_smooth(method=lm, se = FALSE, lty = 2, size = 0.3) +
scale_color_brewer(name = "Grouping of Temperatures", palette = "Set3") +
theme(axis.text.x = element_text(angle = 45))
## `geom_smooth()` using formula 'y ~ x'
As we can see, there are a significant bunching of data points under 10,000K. Using scaling in log10, we will attempt to clear and separate this information.
ggplot(stars, aes(x = temp, y = magnitude, color = groupTemp)) +
scale_x_continuous(
name = "Temperature of the Stars",
limits = c(2450,35000),
expand = expansion(0),
breaks = c(2450, 3000, 3500, 4000, 6000, 8000, 10000, 20000,30000),
trans = "log10"
) +
ylab("Magnitude (Measure of Brightness)") +
ggtitle("Star's Temperature as Compared to its Brightness") +
geom_point(size = 2,alpha = .8) +
geom_smooth(method=lm, se = FALSE, lty = 2, size = 0.3) +
scale_color_brewer(name = "Grouping of Temperatures", palette = "Set3") +
theme(axis.text.x = element_text(angle = 45))
## `geom_smooth()` using formula 'y ~ x'
Unfortunately, it still seems there is a significant bunch under 6000 that would need further clearing. On plot3, we attempt to break the temperature into smaller values to see its affects of log transformation.
ggplot(stars, aes(x = temp/1000, y = magnitude, color = groupTemp)) +
scale_x_continuous(
name = "Temperature of the Stars in Thousands (log10)",
expand = expansion(0),
trans = "log10"
) +
ylab("Magnitude (Measure of Brightness)") +
ggtitle("Star's Temperature as Compared to its Brightness") +
geom_point(size = 2,alpha = .8) +
geom_smooth(method=lm, se = FALSE, lty = 2, size = 0.3) +
scale_color_brewer(name = "Grouping of Temperatures", palette = "Set3") +
theme(axis.text.x = element_text(angle = 45))
## `geom_smooth()` using formula 'y ~ x'
This has no affect to any of the data points. In our last attempt, we look to split up the data into two major sections (<10000 or >10000) and combine the two plots to manually scale and manipulate the x-axis.
p1 <- ggplot(stars[stars$temp<10000,], aes(x = temp, y = magnitude, color = groupTemp)) +
scale_x_continuous(
name = "",
expand = expansion(0),
trans = "log10",
limits=c(2500,10000),
breaks=c(2500,3000,3500,4000,5000,6000,8000,10000)
) +
scale_y_continuous(limits=c(-10,20)) +
coord_cartesian(xlim=c(2500,10050), ylim=c(-10,22), expand=FALSE) +
ylab("Magnitude (Measurement of Brightness)") +
geom_point(size = 2,alpha = .8) +
#geom_smooth(method=lm, se = FALSE, lty = 2, size = 0.3) +
#scale_color_brewer(name = "Grouping of Temperatures", palette = "Set3") +
theme(axis.text.x = element_text(angle = 90),
plot.margin=unit(c(0,-0.2,0,0),"lines"),
legend.position="none")
p2 <-ggplot(stars[stars$temp>=10000,], aes(x = temp, y = magnitude)) +
scale_x_continuous(
name = "",
expand = expansion(0),
limits=c(10000,35000),
breaks=seq(12500,35000,2500)
) +
scale_y_continuous(limits=c(-10,20)) +
coord_cartesian(xlim=c(10000,37000), ylim=c(-10,22), expand=FALSE) +
#ylab("") +
geom_point(size = 2,alpha = .8, color="blue3") +
#geom_smooth(method=lm, se = FALSE, lty = 2, size = 0.3) +
#scale_color_brewer(name = "Grouping of Temperatures", palette = "Set3") +
theme(axis.text.x = element_text(angle = 90),
plot.margin=unit(c(0,0,0,-0.2),"lines"),
axis.title.y=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank())
grid.arrange(arrangeGrob(p1,p2, widths=c(.7,.3),ncol=2),
textGrob("Temperature of the Stars (in K)"),ncol = 1, heights=c(.95,.05))
Though our attempt to completely split out chunks of data points wasn’t 100% successful, what we did get from this exercise was the ability to pull data points in what appears to be a bit more of a linear model when viewed in the last graph, but is more of an negative exponential when sorted in the other graphs.
The purpose of this review was to see a comparison of the brightness a star emits as compared to its temperature. From the measured stars, there exists many outliers from each of the groups. Though we found a general correlation between temperature and the measured brightness (in orders of Magnitude), there was a significant amount of data leaning towards the cooler temperatures. In this exercise we attempt to provide better clarification of these data points by allowing it to be shown in a better format to expand on graphing techniques. Though, this method brings up the question of graphical data ethics by manipulating the axis scales significantly, we have bent the natural curvature of the points.
With additional time, I would like to be able to continue working on this graph to clear up the half circles where the plots were split, find a better scaling solution with as little disruption to the shape of the curve as possible.