Homework 1

Written Answers

Problem 1.

a.) This problem has a quantitative response, which categorizes it as a regression problem. We are interested in the inference as it is an attempt to explain or interpret observations based on the evidence and/or our experience. Thus, it is supervised regression problem and we are most interested in inference. We are trying to explain how the factors affect the salary. In this case, n=500 and p=3.
b.) In this problem we are launching new products and want to classify it as a success or a failure. We look at old products with the same classes. . We are trying to predict if the product will be a success or failure, so we are interested in prediction. Thus, this is a supervised classification problem and we are most interested in prediction. In this case, n=20 and p=13.
c.) In this problem we are working with a quantitative response, which categorizes it as a regression problem. We are trying to predict the percent change in dollar, so we are interested in prediction. Thus, it is a regression problem and we are most interested in prediction. In this case, n=52 and p=3.

Problem 2.

a.)
- i.) A bank either approves or rejects an applicant for a loan. This is based on the income, amount of debt, and credit score. We have data on 750 individuals and we can do the classification analysis to predict if the applicant will be approved or not. The response here is approved/rejected and the predictors are the three conditions of income, amount of debt, and credit score. In this case n=750 and p=3.
- ii.) A couple will decide whether or not they will purchase a house based on twelve conditions: cost, proximity to work (number of miles away from work), crime rate within the area, and nine other factors. We are interested in if a potential buyer will purchase or not purchase a home. We have a sample of 800 couples and by analyzing these 800 couples, we can do the classification analysis to predict if the buyer will purchase a home or not. The response here is purchase/not purchase. The predictors are the twelve conditions listed above. We are interested in prediction here. In this case n=800 and p=12.
- iii.) We are trying to predict if a new shampoo is a success or a failure. This is based on price, how many days it keeps hair clean, competition price, and six other variables. We are interested in if the new shampoo is a success or failure. We have a sample of 45 other shampoos and we can do the classification analysis to predict if the new shampoo will be a success or not. The response here is success/failure. The predictors are the nine conditions listed above. We are interested in prediction here. In this case n=45 and p=9.
b.)
- i.) We are trying to predict the scores of a fantasy football team. We collect data from 100 players and look at three factors: each player’s home team’s wins for the past season, number of tackles made by the player, and number of interceptions made by the player. In order to predict fantasy team’s points, we need build a regression model to regress the fantasy football team on the three factors. The response is the score of the fantasy football team. The predictors are the three factors. In this case, n=100 and p=3 and the goal is prediction.
- ii.) We are trying to predict what the GPA of a chosen student at Loyola will for their senior year of college. The major factors that attribute to their GPA is class attendance rate, participation rate, the number of hours spent studying per day, number of office hours visited, and GPA from past three years. We have data on 2200 senior Loyola students. To predict the GPA of a senior at Loyola, we need to build a regression model to regress the GPA of a chosen senior on the five factors. The response is the GPA of a chosen student at Loyola. The predictors are the listed 5 factors. In this case, n=2200 and p=5 and the goal is prediction.
- iii.) Blood pressure in patients relies on mostly three factors: weight, physical activity, and diet. We want to build a regression model to regress the blood pressure in patients on these factors so we can examine how strong the association between blood pressure and these three factors are. We have data on 500 patients. The response is the blood pressure in patients and the predictors are the three factors mentioned above. The goal is inference. In this case n=500 and p=3.

Problem 3.

college <- read.csv("C:/Users/Kajal/Downloads/College.csv")

b.) I didn’t have to do all of this as the book said because my data didn’t have the University names.
c.i.)

summary(college)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

c.ii.)

pairs(college[, 1:10])

c.iii.)

boxplot(college$Outstate ~ college$Private, col = c("purple", "pink"), main = "Outstate versus Private", 
        xlab = "Private", ylab = "Outstate")

c.iv.) There are 78 elite universities and 699 non-elite universities

Elite = rep("No", nrow(college))
Elite[college$Top10perc > 50] = "Yes"
Elite = as.factor(Elite)
college = data.frame(college, Elite)
fix(college)
summary(college$Elite)

##  No Yes 
## 699  78

boxplot(college$Outstate ~ college$Elite, col = c("red", "orange"), main = "Outstate versus Elite", 
        xlab = "Elite", ylab = "Outstate")

c.v.)

par(mfcol = c(2, 3))
hist(college$Room.Board, breaks = 6, freq = TRUE, col = "pink", main = "Histogram Room and Board Costs", 
     xlab = "Room Board", ylab = "Value")
hist(college$Room.Board, breaks = 9, freq = TRUE, col = "purple", main = "Histogram Room and Board Costs", 
     xlab = "Room Board", ylab = "Value")
hist(college$Books, breaks = 6, freq = TRUE, col = "pink", main = "Histogram of Book Costs", 
     xlab = "Books", ylab = "Value")
hist(college$Books, breaks = 9, freq = TRUE, col = "purple", main = "Histogram of Book Costs", 
     xlab = "Books", ylab = "Value")
hist(college$Personal, breaks = 6, freq = TRUE, col = "pink", main = "Histogram of Personal Costs", 
     xlab = "Personal", ylab = "Value")
hist(college$Personal, breaks = 9, freq = TRUE, col = "purple", main = "Histogram of Personal Costs", 
     xlab = "Personal", ylab = "Value")

c.vi From this exercise it is clear to me as the bins increase, the data becomes easier to follow as it is more broken up. Especially with the books, we see that 5 bins is not enough to understand how much students are paying for books on average.

From the histograms I’ve created, assuming the Room/Board, Books, and Personal are costs, we see on average most students spend the most on room and board and that is the most normally distributed. From the other histograms, its clear that there are more students in the top 25% and it is normally distributed.

par(mfcol = c(2, 2))


hist(college$Top10perc, breaks = 9, freq = TRUE, col = "red", main = "Histogram Top 10%", 
    xlab = "Top 10 Percent", ylab = "Value")

hist(college$Top25perc, breaks = 9, freq = TRUE, col = "blue", main = "Histogram Top 25%", 
     xlab = "Top 25 Percent", ylab = "Value")

Problem 4

1. None of the observations were incorrectly classified as shown on the diagonal axis of the table sloping upwards. This is because k=1. This is not a good model because we would be overfitting, which implies that we can’t generalize the model.

library(class)
training_set <- read.csv("C:/Users/Kajal/Downloads/PA_HW1_train.csv")
ts_new<-cbind(training_set$x1, training_set$x2)
a<-knn(ts_new, ts_new, cl = training_set$col, k=1, prob=TRUE) 
table(a,training_set$col)

##        
## a       green red
##   green    75   0
##   red       0 100

plot(training_set$x1, training_set$x2, col=as.character(a), pch=16, main = "KNN Test Data k=1", xlab = "X1", ylab= "X2")

1. 367 red were incorrectly classifed as green and 352 green were incorrectly classified as red. This is not good the accuracy is 57.4 % as determined by the mean function. This is only slightly better than random guessing at 50%.

test_set <-read.csv("C:/Users/Kajal/Downloads/PA_HW1_test.csv")
test_set_new<-test_set[,-3]
b<-knn(ts_new, test_set_new, cl = training_set$col, k=1, prob=TRUE) 
table(b,test_set$col)

##        
## b       green red
##   green   398 367
##   red     352 633

plot(test_set$x1, test_set$x2, col=as.character(b), pch=16, main = "KNN Training k=1", xlab = "X1", ylab= "X2")

mean(b==test_set$col)

## [1] 0.5891429

Problem 5.

library(MASS)
color<-rep(NA, 150)
color[iris$Species=="setosa"]<-"green"
color[iris$Species=="versicolor"]<-"blue"
color[iris$Species=="virginica"]<-"red"
plot(iris$Sepal.Length, iris$Sepal.Width, col= color, pch=16, xlab = "Sepal Length", ylab = "Sepal Width", main = "Irises based on Sepal Length and Sepal Width")

Problem 6.

The predictions that are made are shown in the confusion matrix below.

iris_data <- lda(iris$Species ~ iris$Sepal.Length + iris$Sepal.Width, data=iris)
predicted <- predict(iris_data, iris)
table(predicted$class, iris$Species)

##             
##              setosa versicolor virginica
##   setosa         49          0         0
##   versicolor      1         36        15
##   virginica       0         14        35