DSLabs Homework

Sulchan Yoon

Data used from DS Labs - Stars.

library(stringi)
stri_rand_lipsum(1)

## [1] "Lorem ipsum dolor sit amet, in ut ac etiam varius vehicula libero, sed molestie inceptos eros. Facilisis, non montes dolor iaculis pellentesque odio suscipit vestibulum. Vel purus morbi. Praesent posuere, sociis augue sed donec in. Sed egestas torquent dapibus vestibulum velit suscipit commodo taciti! Arcu pulvinar per commodo tincidunt diam augue aliquam! Sociosqu quisque praesent non ut sit. Faucibus in congue aptent, lacus sed dolor in."

# Pending intro text

Packages <- c("tidyverse","dplyr","ggthemes","dslabs","scales","grid","gridExtra")
invisible(lapply(Packages, library, character.only = TRUE))

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## 
## 다음의 패키지를 부착합니다: 'scales'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
## 
## 
## 
## 다음의 패키지를 부착합니다: 'gridExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

data(package="dslabs")
list.files(system.file("script", package = "dslabs"))

##  [1] "make-admissions.R"                   
##  [2] "make-brca.R"                         
##  [3] "make-brexit_polls.R"                 
##  [4] "make-death_prob.R"                   
##  [5] "make-divorce_margarine.R"            
##  [6] "make-gapminder-rdas.R"               
##  [7] "make-greenhouse_gases.R"             
##  [8] "make-historic_co2.R"                 
##  [9] "make-mnist_27.R"                     
## [10] "make-movielens.R"                    
## [11] "make-murders-rda.R"                  
## [12] "make-na_example-rda.R"               
## [13] "make-nyc_regents_scores.R"           
## [14] "make-olive.R"                        
## [15] "make-outlier_example.R"              
## [16] "make-polls_2008.R"                   
## [17] "make-polls_us_election_2016.R"       
## [18] "make-reported_heights-rda.R"         
## [19] "make-research_funding_rates.R"       
## [20] "make-stars.R"                        
## [21] "make-temp_carbon.R"                  
## [22] "make-tissue-gene-expression.R"       
## [23] "make-trump_tweets.R"                 
## [24] "make-weekly_us_contagious_diseases.R"
## [25] "save-gapminder-example-csv.R"

data("stars")

EDA

Pending summary of EDA process

We check min and max values for graphing needs as well as check for NA values.

summary(stars)

##           star      magnitude           temp           type          
##  Altair     : 2   Min.   :-8.000   Min.   : 2500   Length:96         
##  *40EridaniA: 1   1st Qu.:-1.800   1st Qu.: 3168   Class :character  
##  *40EridaniB: 1   Median : 2.400   Median : 5050   Mode  :character  
##  *40EridaniC: 1   Mean   : 4.257   Mean   : 8752                     
##  *61CygniA  : 1   3rd Qu.:11.325   3rd Qu.: 9900                     
##  *61CygniB  : 1   Max.   :17.000   Max.   :33600                     
##  (Other)    :89

colSums(is.na(stars))

##      star magnitude      temp      type 
##         0         0         0         0

Manipulation

Pending summary of data manipulation

Here we add a grouping column. The purpose of this is to add a layer to our graphs as it relates to a general category of temperatures.

stars <- stars[order(stars$temp),]
stars <- stars %>%
  mutate(groupTemp = ifelse(temp < 4000, "Less than 4,000",
                            ifelse(temp >= 4000 & temp < 6000, "Between 4,000 and 6,000",
                            ifelse(temp >= 6000 & temp < 8000, "Between 6,000 and 8,000",
                            ifelse(temp >= 8000 & temp < 10000, "Between 8,000 and 10,000",
                            ifelse(temp >= 10000, "Greater than 10,000", NA))))))

Graph

Rules:

Comments describing all chunks of code Meaningful labels for X and y axes Meaningful title A theme for the graph Colors for a third variable with a legend

This initial graph represents the standard ggplot graph of variables Magnitude and Temperature.

ggplot(stars, aes(x = temp, y = magnitude, color = groupTemp)) +
  xlab("Temperature of the Stars") +
  ylab("Magnitude (Measure of Brightness)") +
  ggtitle("Star's Temperature as Compared to its Brightness") +
  geom_point(size = 2,alpha = .8) +
  geom_smooth(method=lm, se = FALSE, lty = 2, size = 0.3) +
  scale_color_brewer(name = "Grouping of Temperatures", palette = "Set3") +
  theme(axis.text.x = element_text(angle = 45))

## `geom_smooth()` using formula 'y ~ x'

As we can see, there are a significant bunching of data points under 10,000K. Using scaling in log10, we will attempt to clear and separate this information.

ggplot(stars, aes(x = temp, y = magnitude, color = groupTemp)) +
  scale_x_continuous(
    name = "Temperature of the Stars",
    limits = c(2450,35000),
    expand = expansion(0),
    breaks = c(2450, 3000, 3500, 4000, 6000, 8000, 10000, 20000,30000),
    trans = "log10"
  ) +
  ylab("Magnitude (Measure of Brightness)") +
  ggtitle("Star's Temperature as Compared to its Brightness") +
  geom_point(size = 2,alpha = .8) +
  geom_smooth(method=lm, se = FALSE, lty = 2, size = 0.3) +
  scale_color_brewer(name = "Grouping of Temperatures", palette = "Set3") +
  theme(axis.text.x = element_text(angle = 45))

## `geom_smooth()` using formula 'y ~ x'

Unfortunately, it still seems there is a significant bunch under 6000 that would need further clearing. On plot3, we attempt to break the temperature into smaller values to see its affects of log transformation.

ggplot(stars, aes(x = temp/1000, y = magnitude, color = groupTemp)) +
  scale_x_continuous(
    name = "Temperature of the Stars in Thousands (log10)",
    expand = expansion(0),
    trans = "log10"
  ) +
  ylab("Magnitude (Measure of Brightness)") +
  ggtitle("Star's Temperature as Compared to its Brightness") +
  geom_point(size = 2,alpha = .8) +
  geom_smooth(method=lm, se = FALSE, lty = 2, size = 0.3) +
  scale_color_brewer(name = "Grouping of Temperatures", palette = "Set3") +
  theme(axis.text.x = element_text(angle = 45))

## `geom_smooth()` using formula 'y ~ x'

This has no affect to any of the data points. In our last attempt, we look to split up the data into two major sections (<10000 or >10000) and combine the two plots to manually scale and manipulate the x-axis.

p1 <- ggplot(stars[stars$temp<10000,], aes(x = temp, y = magnitude, color = groupTemp)) +
  scale_x_continuous(
    name = "",
    expand = expansion(0),
    trans = "log10",
    limits=c(2500,10000),
    breaks=c(2500,3000,3500,4000,5000,6000,8000,10000)
  ) +
  scale_y_continuous(limits=c(-10,20)) +
  coord_cartesian(xlim=c(2500,10050), ylim=c(-10,22), expand=FALSE) +
  ylab("Magnitude (Measurement of Brightness)") +
  geom_point(size = 2,alpha = .8) +
  #geom_smooth(method=lm, se = FALSE, lty = 2, size = 0.3) +
  #scale_color_brewer(name = "Grouping of Temperatures", palette = "Set3") +
  theme(axis.text.x = element_text(angle = 90),
        plot.margin=unit(c(0,-0.2,0,0),"lines"),
        legend.position="none")

p2 <-ggplot(stars[stars$temp>=10000,], aes(x = temp, y = magnitude)) +
  scale_x_continuous(
    name = "",
    expand = expansion(0),
    limits=c(10000,35000),
    breaks=seq(12500,35000,2500)
  ) +
  scale_y_continuous(limits=c(-10,20)) +
  coord_cartesian(xlim=c(10000,37000), ylim=c(-10,22), expand=FALSE) +
  #ylab("") +
  geom_point(size = 2,alpha = .8, color="blue3") +
  #geom_smooth(method=lm, se = FALSE, lty = 2, size = 0.3) +
  #scale_color_brewer(name = "Grouping of Temperatures", palette = "Set3") +
  theme(axis.text.x = element_text(angle = 90),
        plot.margin=unit(c(0,0,0,-0.2),"lines"),
        axis.title.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.text.y=element_blank())

grid.arrange(arrangeGrob(p1,p2, widths=c(.7,.3),ncol=2),
             textGrob("Temperature of the Stars (in K)"),ncol = 1, heights=c(.95,.05))

Final Summary

Though our attempt to completely split out chunks of data points wasn’t 100% successful, what we did get from this exercise was the ability to pull data points in what appears to be a bit more of a linear model when viewed in the last graph, but is more of an negative exponential when sorted in the other graphs.

The purpose of this review was to see a comparison of the brightness a star emits as compared to its temperature. From the measured stars, there exists many outliers from each of the groups. Though we found a general correlation between temperature and the measured brightness (in orders of Magnitude), there was a significant amount of data leaning towards the cooler temperatures. In this exercise we attempt to provide better clarification of these data points by allowing it to be shown in a better format to expand on graphing techniques. Though, this method brings up the question of graphical data ethics by manipulating the axis scales significantly, we have bent the natural curvature of the points.

With additional time, I would like to be able to continue working on this graph to clear up the half circles where the plots were split, find a better scaling solution with as little disruption to the shape of the curve as possible.

DSLabs

Sulchan Yoon

2022-10-25

DSLabs Homework

Sulchan Yoon

EDA

Manipulation

Graph

Rules:

Final Summary