##Question 1

library(readr)

STAR <- read_csv("STAR.csv")
## Rows: 1274 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): classtype
## dbl (3): reading, math, graduated
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(STAR)
## # A tibble: 6 × 4
##   classtype reading  math graduated
##   <chr>       <dbl> <dbl>     <dbl>
## 1 small         578   610         1
## 2 regular       612   612         1
## 3 regular       583   606         1
## 4 small         661   648         1
## 5 small         614   636         1
## 6 regular       610   603         0
summary(STAR)
##   classtype            reading           math         graduated     
##  Length:1274        Min.   :527.0   Min.   :515.0   Min.   :0.0000  
##  Class :character   1st Qu.:602.0   1st Qu.:604.0   1st Qu.:1.0000  
##  Mode  :character   Median :629.0   Median :631.0   Median :1.0000  
##                     Mean   :628.8   Mean   :631.6   Mean   :0.8697  
##                     3rd Qu.:654.0   3rd Qu.:659.0   3rd Qu.:1.0000  
##                     Max.   :753.0   Max.   :774.0   Max.   :1.0000

The variables differ by type because they represent different types of information. Numeric values measure quantities and can be measured summarized using five-number summaries Min/1st Quartile/Median/Mean/3rd Quartile/Max. These describe distribution spread - think graphs and curves. Character and binary variables represent categories or labels, so they are summarized using class or frequency counts.

##Question 2

hist(STAR$math,main="Distribution of Math Scores",xlab ="Math Score", ylab = "Frequency")

The histogram shows the distrobution of math scores in the STAR dataset, with the x axis listing the math score and the y axis listing the frequency. Students performed in the average bell curve with few low and high scores. Main is how you add chart titles, xlab and ylabs adds data labels.

##Question 3

boxplot(STAR$reading, main="Distribution of Reading Scores")

The boxplot shows the distribution of reading scores in the STAR dataset. We see a balanced distribution with most data centered in the middle and then a circle at the top is an outlier. Boxplots are good at showing outliers or possible typos to double check.

##Question 4

plot(STAR$math, STAR$reading, main="Academic Scores")

##remember capitalization matters STAR does not equal star and will give you an error
##this is another way to add labels star$math is your x axis, star$reading is y axis (x,y)

##Question 5

total<-STAR$math+STAR$reading
summary(total)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1044    1213    1257    1260    1309    1476
hist(total,main="Student Reading and Math Scores")

##Question 6

grad_yes <- STAR$math[STAR$graduated == 1]
grad_no <- STAR$math[STAR$graduated == 0]

This builds the variables grad yes and grad no. Each variable tells us the math score for each graduate or non graduate student. The 1 and 0 is in line with the binary data in the dataset. Now I can find the mean function of each and determine the answer to the problem.

mean(grad_yes)
## [1] 635.3258
mean(grad_no)
## [1] 606.6386

The scores are a mean of 635.3528 for graduated students and 606.6386 for students who did not graduate. Graduate students have a higher math score.