Homework #2

#Question 4.1

Describe a situation or problem from your job, everyday life, current events, etc., for which a clustering model would be appropriate. List some (up to 5) predictors that you might use.

New baseball statistics are emerging as high-speed cameras track players and ball movement. Unlike previous statistics that measured results (i.e. a hit or an out), many of these new statistics measure the quality of batted ball contact. The belief is that overtime the most skilled players hit the ball hard, in the air, and in the middle of the bat (the barrel). Thus, focus on skills rather then results and overtime you can identify the better baseball players. Looking for emerging players who’s skills (the 5 below) cluster with established stars would give talent evaluators a leg up.

Five new statistics to use as predictors: 1. Exit Velocity 2. Barrell Rate 3. Average Distance Ball Hit 4. Hard Hit Rate *5. Fly Ball/Ground Ball Ratio

#Question 4.2

Use the R function kmeans to cluster the points as well as possible. Report the best combination of predictors, your suggested value of k, and how well your best clustering predicts flower type.

After removing the species from the data, I used ggplot to get a sense of the relationships of each predictor and the possible number of clusters. The petal width and petal length plot looked clearly like three clusters.

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point()

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point()

ggplot(iris, aes(Sepal.Length, Petal.Length, color = Species)) + geom_point()

ggplot(iris, aes(Sepal.Length, Petal.Width, color = Species)) + geom_point()

ggplot(iris, aes(Sepal.Width, Petal.Length, color = Species)) + geom_point()

ggplot(iris, aes(Sepal.Width, Petal.Width, color = Species)) + geom_point()

ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()

I then scaled the data.

iris_scale<-scale(iris_data)
head(iris_scale)

##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,]   -0.8976739  1.01560199    -1.335752   -1.311052
## [2,]   -1.1392005 -0.13153881    -1.335752   -1.311052
## [3,]   -1.3807271  0.32731751    -1.392399   -1.311052
## [4,]   -1.5014904  0.09788935    -1.279104   -1.311052
## [5,]   -1.0184372  1.24503015    -1.335752   -1.311052
## [6,]   -0.5353840  1.93331463    -1.165809   -1.048667

After plotting, I created two data sets consisting of scaled data and unscaled data. I then started to determine what would be the optimal number of centers to use in my kmeans formula. The raw data lists three species and the gplot above suggests three clusters. I also conducted an elbow test on the scaled and unscaled data. Both plots confirm that three centers is a sound choice.

Here is the elbow plot of the unscaled data.

Here is the elbow plot of the scaled data.

I then ran kmean on both the unscaled and scaled data, selecting 3 centers and just in case 4 centers.All my tables are attached in my notes. In summary, the easiest species to classify regardless of the model was setosa. Every model that used all the predictors, perfectly clustered the setosa species. While it is debatable if 3 centers or 4 centers are better, I think using only Petal Width and length suggested by the plots above is hands down the best model. The model that stood out was k=3, petal width and length model. Both the scaled and unscaled data had strong results. For example, the scaled k=3 model gave me clusters of 50,48, and 46.

##    type_species
##     setosa versicolor virginica
##   1      0         48         4
##   2      0          2        46
##   3     50          0         0

#Question 5.1

Test to see whether there are any outliers in the last column (number of crimes per 100,000 people). Use the grubbs.test function in the outliers package in R..

Steps to find outliers

  1. Summarize data
  2. Graph data and get sense of possible outliers
  3. Assume normal distribution
  4. Grubbs Test

With a max of 1993 compared to its mean and median, I think there may be a skew toward larger numbers.

summary(us_crime$Crime)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   342.0   658.5   831.0   905.1  1057.5  1993.0

A simple histogram of the data suggest that there are some potential outliers in the larger numbers

hist(us_crime$Crime,main="Histogram with Default Parameters")

A box and whisker plot depicts outliers as well.

boxplot.default(us_crime$Crime)
outvals=boxplot(us_crime$Crime)$out

outvals

## [1] 1969 1674 1993

I created a normal probability plot that shows that the data besides outliers looks fairly normal.

nprob=qqnorm(us_crime$Crime)
qqline(us_crime$Crime)

I ran three different grubbs test. Surprisingly, will all p-values> .05, the grubbs test indicates there are no outliers.

grubbs.test(us_crime$Crime, type=10)

## 
##  Grubbs test for one outlier
## 
## data:  us_crime$Crime
## G = 2.81287, U = 0.82426, p-value = 0.07887
## alternative hypothesis: highest value 1993 is an outlier

grubbs.test(us_crime$Crime, type=10, opposite = TRUE)

## 
##  Grubbs test for one outlier
## 
## data:  us_crime$Crime
## G = 1.45589, U = 0.95292, p-value = 1
## alternative hypothesis: lowest value 342 is an outlier

grubbs.test(us_crime$Crime, type=11)

## 
##  Grubbs test for two opposite outliers
## 
## data:  us_crime$Crime
## G = 4.26877, U = 0.78103, p-value = 1
## alternative hypothesis: 342 and 1993 are outliers

This didn’t feel right to me, so I ran the Fence test. This test multiplies the 1stQ by 1.5 and the 3rd Q by 1.5 to create limits. Anything below or above these limits may be outliers.

# LowerInner Fence
LowerInnerFence <- 658.5 - 1.5 * crime.df.IQR
# UpperInner Fence
UpperInnerFence <- 1057.5 + 1.5 * crime.df.IQR

The fence test suggest that 1969,1674,1993 are outliers.

# Values above the upper tukey fence
(us_crime$Crime[which(us_crime$Crime > UpperInnerFence)])

## [1] 1969 1674 1993

# Values below the lower tukey fence
us_crime$Crime[which(us_crime$Crime < LowerInnerFence)]

## integer(0)

Given that the Grubbs Test conflicts with other measurements of outliers, the correct answer is to investigate the actual whisker plot and fence data to better understand the data points. Are they data entry errors or really valid data points.

#Question 6.1

Describe a situation or problem from your job, everyday life, current events, etc., for which a Change Detection model would be appropriate. Applying the CUSUM technique, how would you choose the critical value and the threshold?.

Sticking with baseball, a Change Direction Model would be useful in determining players who are regressing and no longer worthy of top salaries. CUSUM St stat should remain less than the threshold if you plan on paying a player based on previous season stats. The bigger the star and the bigger the contract suggests that the threshold choice T should vary by the money and commitment needed to sign a player. A journeyman signing a one year deal can have a larger T and C value. However, longer term deal with large sums should have lower C and T values because signing a regressing player to an expensive long-term deal can handicap a team for years.

#Question 6.2 (1)

Using July through October daily-high-temperature data for Atlanta for 1996 through 2015, use a CUSUM approach to identify when unofficial summer ends (i.e., when the weather starts cooling off) each year.

I created an upper and lower bound Change in Direction model. Since we are only focusing on the weather dropping we can focus on the lower bound. I average each day’s temperature for every year in the database. The result was 88.7. My first analysis used the typical C and T variables. C was equal to 1/2 of July data standard deviation (0.4426) and T was 5 times July’s standard deviation (4.42624747). This model indicated that the end of summer was August 30th which corresponds to the traditional thoughts of end of summer. (i.e. back to school and Labor Day). I have attached my excel spreadsheet and my graphs. Naturally, changing C & T will change when the model indicates the end of summer.

#Question 6.2 (2) Use a CUSUM approach to make a judgment of whether Atlanta’s summer climate has gotten warmer in that time (and if so, when).

Using the analysis above, I defined the summer as from July 1 to August 30th. I then average each years temperature from those dates. That is in the first analysis the mean was calculated across rows, this analysis was conducted down columns. The model does not come close to pier the lower band. Thus, the summer has not become warmer. Additionally, I graphed min, max, and standard deviations of temperatures per year from July 1 to August 30th. This data gives no indication that Atlanta’s summer are warmer now.

I do not know how to include excel charts and stats in R Markdown. Please see the attachments for more data and answers. Thank you

Homework #2

Lawrence Pereira

5/26/2020