STATS-HW-3.knit

Title: Homework 3

Author: Brandon Flores

Date: Sept. 15th, 2021

The QQ - Plot shown below expresses a rightward skewness of the two data sets "salaries" & "annual salary". Also a concentration with a sharp spike of variables can be shown at the 2nd quantile which indicates a spike of identical values between the two datasets. From quantile -3 to somewhat past -2 theirs a flatness of the distributions expressing a concentration between the two data sets; also where the data peaks in the data. With the far right of the plot showing a large gap of plots with only a few plots to be shown extremely dispersed among the distribution yet still following a rightward skewness. This expresses the outlieres that exist within the distribution. 

When comparing the two QQ-Plots you can again see that the distribution of the data is not perfectly linear with much more of a "disruption" in the datapoints after Quantile 2. Again, the QQ-Plot of the datasets shows when compared to the other QQ-Plot it is not normally distributed. With QQ-Plot (qqplot(y) representing a normal distribution. 

As you observe the Normal Probablity Plot as well as the two histograms what can be shown is the same interpretation of that of the QQ-Plot. A rightward skewness with the concentration or its "peak" in the distribution at the far left end of the data; with a sharp decline after $100,000 then only little movement of the distribution expresssing the various datapoints that are expressively further from the peak of the distribution. This dense concentration of salaries can be better observed  from the Normal Distribution plot created within a historgram that really focuses on that part of the distribution. It can be observed that the majority of salaries are between $50K and $75K. 

Once the freequency table of the data is observed; those who make under $75,000 is at 73.5%, those who make over $125,000 is at 5.26%, and those who make between $75K & $125K make up 21.3% of the data. This again shows how the data is distributed with the bulk being at under $75K and the handful of outliers that exist over $125K. 

In completion of the Part 2 of HW3 on Excel over the fluffy kittens the output is as follows: the perentage of fluffy kittens that weight between 2.8 and 4.8 pounds is 97.59% of the distribution. For the kittens weighting at 2.8 pounds the z-score equalled -2 & for 4.8 pounds it equaled 3. -2 on the z-table was at .0228 & 3 being at .9987. The difference of these two values totals at 97.59%. 

For those weighing at 2 pounds the z-score was at -4 & 3.5 for those kittens weighing at 5 pounds. For the z-score of -4 it totals .001 and the z-score of 3.5 being at .9998. The difference of these two numbers totals at 99.97%. This shows that the majority of kittens weight between 2 and 5 pounds; a bit more of a wider scope of the kitten population when compared to those that fall in between 2.8 and 4.8 pounds. 

 For calculating the top percentile value of the kittens weight we added the mean (3.6) with the totaled value  z-score closest to .99 (-2.33). Being that it is a negative it is flipped because of the square root function. This is multiplied first by the standard deviation (.4) totaling .932. This is then added with the mean of 3.6 gathering a total of 4.532. This means that the top percentile of the weight of the kittens is at about 4.5. When observing this top percentile it can be concluded that the weight of kittens at 4.8 pounds can be seen as an outlier from the mean; even more so with kittens at 5 pounds. (Output of data and graph submitted as photo)

library(readr, quietly = TRUE)
salaries = read_csv(file = "C:\\Users\\BTP\\Desktop\\Sam Houston Salaries (1).csv")
names(salaries) = tolower(names(salaries))

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(scales)
ggplot(data = salaries, mapping =  aes(annual_salary))+
geom_histogram(binwidth=10000)+
ggtitle(label="Distribution of Annual Salaries")+
xlab(label="Annual Salaries")+ 
ylab(label = "Total Number of Faculty")+
scale_x_continuous(labels=label_dollar())

library(dplyr)

salaries$salcat <-car::Recode(salaries$annual_salary, recodes="6012:75000 ='<75000'; 75000:125000 ='75000 - 125000'; 125000:456216  ='>125000'; else=NA", as.factor=T)

summary(salaries)

##  position_title     home_organization_desc annual_salary   
##  Length:2225        Length:2225            Min.   :  6012  
##  Class :character   Class :character       1st Qu.: 40950  
##  Mode  :character   Mode  :character       Median : 55485  
##                                            Mean   : 63822  
##                                            3rd Qu.: 76770  
##                                            Max.   :456216  
##             salcat    
##  <75000        :1635  
##  >125000       : 117  
##  75000 - 125000: 473  
##                       
##                       
##

salaries %>% 
    group_by(salcat) %>% 
    summarise(n = n()) %>%
    mutate(percentage = n / sum(n)*100)

## # A tibble: 3 x 3
##   salcat             n percentage
##   <fct>          <int>      <dbl>
## 1 <75000          1635      73.5 
## 2 >125000          117       5.26
## 3 75000 - 125000   473      21.3

library(ggplot2)
library(scales)
ggplot(data = salaries,mapping = aes(annual_salary))+geom_histogram(binwidth = 10000)+ggtitle(label="Distribution of Salaries")+xlab(label="Salaries")+scale_x_continuous(labels=label_dollar())

y <- rnorm(2225,63821.58,36170.47)

library(ggplot2)
ggplot(mapping = aes(y))+
  geom_histogram()+
  ggtitle(label="Distribution of Simulated Salaries")+
  xlab(label="Salaries")+
scale_x_continuous(labels=label_dollar())

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qqnorm(y)

options(scipen=100)
qqnorm(salaries$annual_salary)