DS Labs Assignment: Star’s Magnitude: Investigating Incremental Levels

Author

N Bellot Norman

Published

June 23, 2024

Install library package to obtain and select data from list

library("dslabs")
data(package="dslabs")
list.files(system.file("script", package = "dslabs"))

 [1] "make-admissions.R"                   
 [2] "make-brca.R"                         
 [3] "make-brexit_polls.R"                 
 [4] "make-calificaciones.R"               
 [5] "make-death_prob.R"                   
 [6] "make-divorce_margarine.R"            
 [7] "make-gapminder-rdas.R"               
 [8] "make-greenhouse_gases.R"             
 [9] "make-historic_co2.R"                 
[10] "make-mice_weights.R"                 
[11] "make-mnist_127.R"                    
[12] "make-mnist_27.R"                     
[13] "make-movielens.R"                    
[14] "make-murders-rda.R"                  
[15] "make-na_example-rda.R"               
[16] "make-nyc_regents_scores.R"           
[17] "make-olive.R"                        
[18] "make-outlier_example.R"              
[19] "make-polls_2008.R"                   
[20] "make-polls_us_election_2016.R"       
[21] "make-pr_death_counts.R"              
[22] "make-reported_heights-rda.R"         
[23] "make-research_funding_rates.R"       
[24] "make-stars.R"                        
[25] "make-temp_carbon.R"                  
[26] "make-tissue-gene-expression.R"       
[27] "make-trump_tweets.R"                 
[28] "make-weekly_us_contagious_diseases.R"
[29] "save-gapminder-example-csv.R"

Install the selected dataset as well as additional packages needed for the analysis

data("stars")
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggfortify)
library(htmltools)
library(ggthemes)
library(ggrepel)
library(readr)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

Internal Notes1

write_csv(stars, "stars.csv", na="")

Internal Notes2

#?write.csv

Internal Notes3

#write.csv(data1, "stars.csv")

Get working directory

getwd()

[1] "C:/Users/naomi/OneDrive/Desktop/Desktop of 11-08-2022/Community College Classes/DATA 110/DS Lab"

Read the selected data as a csv file

stars <- read_csv("stars.csv")

Rows: 96 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): star, type
dbl (2): magnitude, temp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Review the first few rows of the data

head(stars)

# A tibble: 6 × 4
  star           magnitude  temp type 
  <chr>              <dbl> <dbl> <chr>
1 Sun                  4.8  5840 G    
2 SiriusA              1.4  9620 A    
3 Canopus             -3.1  7400 F    
4 Arcturus            -0.4  4590 K    
5 AlphaCentauriA       4.3  5840 G    
6 Vega                 0.5  9900 A

Categorize levels and create new column using case when conditions

stars <- stars %>%
  mutate(stars = case_when(
    # Define the conditions for each star category
    temp > 6000 & magnitude < 1 ~ "Dwarfs",
    temp < 6000 & magnitude > 10 ~ "Giants",
    temp < 10000 & magnitude > 100 ~ "Supergiants",
    TRUE ~ "Super Stars"
  ))

Review updated dataset

head(stars)

# A tibble: 6 × 5
  star           magnitude  temp type  stars      
  <chr>              <dbl> <dbl> <chr> <chr>      
1 Sun                  4.8  5840 G     Super Stars
2 SiriusA              1.4  9620 A     Super Stars
3 Canopus             -3.1  7400 F     Dwarfs     
4 Arcturus            -0.4  4590 K     Super Stars
5 AlphaCentauriA       4.3  5840 G     Super Stars
6 Vega                 0.5  9900 A     Dwarfs

View the last rows of information out of curiosity

tail(stars)

# A tibble: 6 × 5
  star         magnitude  temp type  stars      
  <chr>            <dbl> <dbl> <chr> <chr>      
1 *40EridaniA        6    4900 K     Super Stars
2 *40EridaniB       11.1 10000 DA    Super Stars
3 *40EridaniC       12.8  2940 M     Giants     
4 *70OphiuchiA       5.8  4950 K     Super Stars
5 *70OphiuchiB       7.5  3870 K     Super Stars
6 EVLacertae        11.7  2800 M     Giants

Plot the Scatterplot where temp is on the x-axis, magnitude is on the y-axis, and the newly created category will be plotted by color to differentiate type of stars

p <- ggplot(stars, aes(x = temp, y = magnitude, color = stars)) +
  geom_point(size = 3) + #the size added to the plotted data points
  geom_hline(yintercept = 0, linetype = "dashed", color = "purple") + #select a unique color to differentiate the y-intercept to show the negative values 
  labs(title = "Star's Magnitude: 
       Investigating Incremental Levels",#use space to separate long title 
       x = "Temperatures", #name of x-axis reflected in the visualization
       y = "Magnitude", #name of y-axis reflected in the visualization
       color = "stars") + #assign color to the new category
  theme_minimal() + #use different style rather than the default theme
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"), #dimensions for the title
    axis.title.x = element_text(size = 12, face = "bold"), #dimensions for the x-axis
    axis.title.y = element_text(size = 12, face = "bold"), #dimensions for the y-axis
    legend.position = "top" #position the legend to the top
  )

Map size differentiation in magnitude

stars <- stars %>%
  mutate(temp_Size = case_when(
    magnitude > 14 ~ 10, #assign the size when it meets the criteria for magnitude greater than 14
    magnitude > 10 ~ 3, #assign the size when it meets the criteria for magnitude greater than 10 
    TRUE ~ 1 #assign a size 1 when the value does not fit the other criteria                 
  ))

Calculate the Correlation Coefficient to analyze the relationship between variables

cor_coefficient <- cor(stars$temp, stars$magnitude)
print(paste("Correlation Coefficient: ", cor_coefficient))

[1] "Correlation Coefficient:  -0.633190799835397"

Calculate the Adjusted R-Squared to determine how much variation in the dependent variable is explained by variation in the independent variable

stars <- lm(temp ~ magnitude, data = stars) #conduct fit for a linear model for the variables
summary_stars <- summary(stars)
adj_r_squared <- summary_stars$adj.r.squared #adjusted variation from the model summary
print(paste("Adjusted R-squared: ", adj_r_squared))

[1] "Adjusted R-squared:  0.394557510155724"

Convert ggplot into Plotly

p_plotly <- ggplotly(p) %>% 
  layout(
    legend = list(
      x = 0.8, 
      y = 0.9,
      bgcolor = "rgba(255, 255, 255, 0.5)" #assigning colors to the data points
    ))

Run Plotly Output (Interactive Visualization)

p_plotly

Data Selection and Visualization

From the DS Labs dataset, I selected “stars.” The x-axis displays temperatures ranging from 2500 to 33600 Kelvin. The y-axis represents magnitude, which ranges from -8 to 17. Initially, there were too many designations for the “type” category, which could be confusing for the untrained scientist. Therefore, I simplified it by creating a third variable/column labeled ‘stars’ and categorized the incremental levels based on brightness into Dwarfs, Giants, and Super Stars adopted from the official astronomical naming convention (https://observatory.astro.utah.edu/Stars.html).

The plotly is my preferred interactive visualization. It allows the user to hover over each point, which highlights vital information. For example, when I hover over the point located on the lower most right hand corner (red). It provides the following information: temp: 33600, magnitude: -5.9, stars: Dwarfs.

Correlation Coefficient & Adjusted R-Squared

The correlation coefficient is -0.633. This means that the relationship between these two numerical variables are negative and moderate. This means that only 39.4% (adjusted r-squared) of the star’s magnitude is used to predict temperature and that there are other variations/contributing factors that may be at play.

Analyzing the Graph

The red color signifies warmer stars because the temperatures are higher. The levels of magnitude are located at and below the x-axis (negative numbers). As a result of the data points and what they represent, they are dim and are referred to as Dwarfs. The range expands horizontally.

The blue colors are generally cooler because the temperatures are generally less than 10,000. However, there are a few temperatures that are higher than 10,000, and in those instances, the magnitude are also extremely higher. These unique characteristics best summarize the Super Giants. The Super Giants also have a wide vertical range.

The green color represents magnitude over 10 and low temperatures within the 2000 to 3000 range. These stars are bright. They are also called Giants.The lower the temperature and the higher the magnitude, the stars will be cooler and brighter.