Homework 1

##Kevin Kuipers (Completed Byself)

##01/15/2019

Please do the following problems from the text book ISLR.

##1. Question 2.4.2 pg 52

#Problem 2a

The data set pertaining to CEO salary it would be a regression problem as it deals with 3 continuous variables, salary, number of employees, and profit. The other variable would be a categorical variable which would be industry. In this case we are looking at which variables contribute to the CEO salary, therefore this would be a inference. The data collected on the top 500 firms and there are 3 predictors for CEO salary, n=500 and p=3.

#Problem 2b

The data set pertains to a product that is going to be luanched and they want to know whether it will be success or failure. These two options convey a classification problem. And since they want to know something about the future it is a prediction to see if the product will be a success or failure. The data was collected on 20 similar products and the predictors recorded are price charged for the product, marketing budget, competition price and 10 others. Therefore, n=20 and p=13.

#Problem 2c

This is a prediction problem as they are interested in predicting the percent change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. The nature of the data quantitative as it deals with percent change in the USD/Euro, percent change in the US Market, percent change in the British market, and the percent change in the German market. Therefore, this would be a regression problem. Since they recorded weekly data for all of 2012 the sample size would be, n=52, and the number of predictors would be p=3, as they are looking at markets from Britian, Germany, and the US.

##2. Question 2.4.4 pg 53

#Problem 4a Real-life Classification Problems

Will a patient a recover from cancer undergoing chemo-therapy. The dataset would contain age, gender, type of cancer, satisfaction of life score from 1 to 10, BMI, Exercise yes or no, and the outcome of recovering from the cancer. This would determine if a given patient would recover or not
Determining SPAM Email – the features would be the words associated spam and words not associated with spam. The desired outcome variable would signify SPAM or Non-SPAM based on text.
Determining the type of flower/Iris. The 4 Co-variates would be sepal length and width, petal length and width. The desire outcome variable would the type of iris. the petal length and width would be used to determine what kind of iris it is.

#Problem 4b Real-life Regression Problems

Linear regression to see if the attendance of a church service determines the money received in the offering. This data would be recorded for an entire year and see if the weekly church attendance and money received in the offering has a relationship on each other. If so we would want to predict how much money would be received based on attendance for budgeting purposes. This would be a used for prediction purposes.
Predicting a student academic outcome. Features used would be Socio-economic status (low, medium, high), IQ, standardized test scores, single parent (yes or no), school’s fundng level, private vs public vs charter school and the outcome variable would be GPA. This would be a prediction based on the variables how the student will perform at in college.
Predict house values using number of rooms, square footage of house, number of car garage, and the desired outcome would the housing value. This would be used to know how to assess if a given house priced correctly. This would be a inference.

#Problem 4c Real-life cluster analysis

Amazon Adverstizing - finding groups of customers and adverstising to them based on region, gender, age, and previous purchases
Grocery Store location - Finding the best location to plant a grocery store based on area needs, proximity to other food places, and highly populated areas
Video game rating - Cluster video game ratings (E, T, M) based on violence, langauage, suggestive content.

3. Question 2.4.6 pg 53

Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?

Parametric approach begins with the assumption pertaining to a functional form. An easy example to think of for parametric approach is a linear regression model. It contains a specific set of parameters and simplifies estimating the function. When dealing with the parametric methods we often talk about the flexibility and linear vs non-linear functional forms when fitting the data.

Non-Parametic approach does not make assumptions about the of the type of mapping for the function. We think of these as the machine learning models such as K-nearest neighbors, support vectors, and decision trees. Non-paramertic looks for patterns that are close the output variable and estimates the function.

With both approaches there are advantages and disadvantages. Advantages for parametric approach is that it is a rather simple method to understand and explain the results, the methods for compiling it are fast and the does not always require large amounts of data to train and test. A disadvantage for the parametric approach can produce a inferrior model as the methods may not line up with mapping function. Other disadvantages are that parametic methods are geared for less complex problems and constrained to specific functional form.

Advantages of the non-parametic method allow more flexibility as it permits fitting the data to many functional forms. When it comes to the performance on the model it can produce better results for predictions. Finally, it does not have to stick to conventional assumptions about the data and function. However, non-parametic approach does come with disadvantages. One of the disadvantages is that it cna contribute to over fitting the training data and does not always allow for easy explaination of how it came about to the desired prediction(s). When training the data it requires a lot more training data compared to parametric methods. Lastly, when fitting the model it is much slower due to having more parameters.

4. Question 2.4.8 pg 54-55

#Problem 8a

I will read the College.csv data set into R using the read.csv() command. The dataset is found in the library ISLR.

I could no load it from te ISLR library so I downlowed .csv file and loaded it from my directory.

#Accessing library
library(ISLR)
#obtaining data set
#data(College)
#Reading data set
#college <- read.csv("College.csv")

#loaded from my dictory on my computer 

college <- read.csv("C:/Users/Agent000/Documents/DSU/Classes/2019/01 Spring/STAT 602 - Modern Applied Statistics II/Week 1/College.csv")

#Problem 8b

I will use the fix() command on the data to get a overview of how the data looks and is arranged. It is a nice little command that kind of puts the data in the overview format for looking at the data in a table.

fix(college)

Since the first column does not have a header and each row represents a college the following code below will create another column named row.names which represents all the colleges. The code comes directly from the book.

rownames (college) <- college[,1]
fix(college)

As you can see from the code line above it duplicated the column containing the row names and labeled it row.names. Now the code in the book wants me to get rid of the columns that contain college names and it wants the first column to be Private. But the code in the book is incorrect and thus I will modify it get the desired outcome it wants which is the first column should be Private

college <- college[,-1]
row.names(college) <- NULL
fix(college)

#Problem 8c i)

Now I will use the summary command to produce a numerical summary

summary(college)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

#Problem 8c ii)

I will use the pairs() function to produce a scatterplot of matrix of the first 10 columns of the data. I will also do this using the ggpairs() command from the GGally library

pairs(college[,1:10])
#install.packages("GGally")
library(GGally)

ggpairs(data=college[,1:10])

the histograms of the various variables are rater interesting since apps, accept, enroll, top10per, F.undergrad, P.Undergrad all seem to have positive skews. Top25 perc, Outstate, and Room.Board are roughly a normal distribution. Some of the variables have a strong positive relationship amongst each other. For example F.Undergrad has strong positive correlation with with Apps, Accept, Enroll. However, Enroll has a strong positive relationship with Apps and Accept.

#Problem 8 c iii)

I will use the plot() command to product a side-by-side boxplots of Outstate vs. Private. This will also be done using ggplot comand from library tidyverse

#install.packages("tidyverse")
#Loading the library for ggplot
library(tidyverse)

#using the Base R plot to create a boxplot of Private College and Out of State Tuition
plot(college$Outstate ~ college$Private, xlab="Private College", ylab="Out of State Tuition", main="Boxplot of Private College vs Out of State Tuition")

#using ggplot to create a boxplot of Private College and Out of State Tuition
ggplot(data=college, aes(Private, Outstate)) + geom_boxplot() + labs(x="Private College", y="Out Of State Tuition", title="Boxplot of Private College vs Out of State Tuition")

It Appears that Private colleges have higher rates of out of state tuition than the non-private colleges. However, there are some outliers for Non-Private colleges.

#Problem 8 c iv)

I will create a new variable called Elite by using the Top10perc variable. The colleges are going to contain a categorical variable on whether or not the top 10% of their high school classes exceeds 50%

Elite =rep ("No",nrow(college ))
Elite [college$Top10perc >50]="Yes"
Elite =as.factor (Elite)
college =data.frame(college ,Elite)

Now I will use the summary command to see how many elite colleges there are.

summary(college$Elite)

##  No Yes 
## 699  78

So there are 78 Elite colleges based on our criteria and 699 that are not.

Now I will use the plot() command and ggplot() to product boxplots of Out of state tuition vs Elite

plot(college$Outstate ~ college$Elite, xlab="Elite College", ylab="Out Of State Tuition", main="Boxplot of Elite Colleges vs Out Of State Tuition")

ggplot(college, aes(Elite, Outstate)) + geom_boxplot() + labs(x="Elite College", y="Out Of State Tuition", title="Boxplot of Elite Colleges vs Out of State Tuition")

It appears that Elite Colleges have a higher rate of out of state tuition than the non-elite. However, it appears that Non-Elite colleges has a couple outliers.

#Problem 8 c v)

I will create histograms using the Outstate variables using the hist() function. I will create 4 histograms using the same variable but changing the bin sizes between 5, 10, 15, and 20. I will also do this using the ggplot() command combinded with gridExtra package

par(mfrow=c(2,2))
hist(college$Outstate, breaks=5, main="Histogram of College Out of State Tuition, bin=5", xlab="Out Of State Tuition")
hist(college$Outstate, breaks=10, main="Histogram of College Out of State Tuition, bin=10", xlab="Out Of State Tuition")
hist(college$Outstate, breaks=15, main="Histogram of College Out of State Tuition, bin=15", xlab="Out Of State Tuition")
hist(college$Outstate, breaks=20, main="Histogram of College Out of State Tuition, bin=20", xlab="Out Of State Tuition")

#install.packages("gridExtra")
library(gridExtra)

h1 <- ggplot(college, aes(Outstate)) + geom_histogram(fill="cornsilk", colour="grey60", bins=5) + labs(title="Histogram of College Out of State Tuition, bin=5", x="Out of State Tuition")

h2 <- ggplot(college, aes(Outstate)) + geom_histogram(fill="cornsilk", colour="grey60",bins=10) + labs(title="Histogram of College Out of State Tuition, bin=10", x="Out of State Tuition")

h3 <- ggplot(college, aes(Outstate)) + geom_histogram(fill="cornsilk", colour="grey60",bins=15) + labs(title="Histogram of College Out of State Tuition, bin=15", x="Out of State Tuition")

h4 <- ggplot(college, aes(Outstate)) + geom_histogram(fill="cornsilk", colour="grey60",bins=20) + labs(title="Histogram of College Out of State Tuition, bin=20", x="Out of State Tuition")


grid.arrange(h1, h2, h3, h4, nrow=2)

Even as we increase the number bins it appears the Out State Tuition has normarl to slightly positive skew.

Now lets look at top10perc which is the New students from the top 10% of H.S. class

par(mfrow=c(2,2))
hist(college$Top10perc, breaks=5, main="Top 10% New Students from H.S. Class , bin=5", xlab="Top 10%")
hist(college$Top10perc, breaks=10, main="Top 10% New Students from H.S. Class , bin=10", xlab="Top 10%")
hist(college$Top10perc, breaks=15, main="Top 10% New Students from H.S. Class , bin=15", xlab="Top 10%")
hist(college$Top10perc, breaks=20, main="Top 10% New Students from H.S. Class , bin=20", xlab="Top 10%")

#install.packages("gridExtra")
library(gridExtra)

h5 <- ggplot(college, aes(Top10perc)) + geom_histogram(fill="cornsilk", colour="grey60", bins=5) + labs(title="Top 10% New Students from H.S. Class , bin=5", x="Top 10%")

h6 <- ggplot(college, aes(Top10perc)) + geom_histogram(fill="cornsilk", colour="grey60",bins=10) + labs(title="Top 10% New Students from H.S. Class , bin=10", x="Top 10%")

h7 <- ggplot(college, aes(Top10perc)) + geom_histogram(fill="cornsilk", colour="grey60",bins=15) + labs(title="Top 10% New Students from H.S. Class , bin=15", x="Top 10%")

h8 <- ggplot(college, aes(Top10perc)) + geom_histogram(fill="cornsilk", colour="grey60",bins=20) + labs(title="Top 10% New Students from H.S. Class , bin=20", x="Top 10%")


grid.arrange(h5, h6, h7, h8, nrow=2)

It appears that there is a postive skew in the top 10% of new students from H.S. Class even as we increase the bin width.

Now I will look at couple different aspects of the data. corrplot to get a quick idea of the correlation coefficients between all the variables in the dataset for all the continuos variables

library(corrplot)
correlation <- cor(college[2:17])
corrplot(correlation)

It appears that Apps Accept and Enroll are all strong correlated amongs themselves. Top10perc & Top25Perc are all highly correlated amongs themselves. Terminal and PHD are also high correlated. F.Undergrad is strongly correlated wth Apps, Accept, and Enroll. This plot just easily identifies for the continuous variables which ones are correlated or not. The blue means a positive relationship and the red/orange means a negative relationship. All the variables that I mentioned being highly correlated are positive relationship amongst them selves.