Question 1 (concept)[15p]

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide \(n\) and \(p\).

  1. We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
  2. We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
  3. We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

Answer 1

  1. regression problem. We want to use profit, number of employees, industry to predict CEO salary. n=500,p=3

  2. classification problem. We want classify the new product as a success or a failure using price charged for the product, marketing budget, competition price, and ten other variables n=20,p=13

  3. regression problem. We want predict the % change in the USD/Euro exchange rate using the % change in the US market, the % change in the British market, and the % change in the German market. n=52,p=3

Question 2 (applied)[35p for part (c)]

This exercise relates to the College data set, It contains a number of variables for 777 different universities and colleges in the US. The variables are

Before reading the data into R, it can be viewed in Excel or a text editor.

  1. Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data. The R commands getwd() and setwd() may be helpful.

  2. Look at the data using the head() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:

rownames(college) <- college[, 1]
View(college)

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try

college <- college[, -1]
View(college)

Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.

  1. Use the summary() function to produce a numerical summary of the variables in the data set.
  2. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].
  3. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.
  4. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.
Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)

Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

  1. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow = c(2, 2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.
  1. Continue exploring the data, and provide a brief summary of what you discover.

Answer 2

a college=read.csv(“./College.csv”)

b rownames(college) <- college[, 1] college <- college[, -1]

c i summary(college[,-1]) ## Apps Accept Enroll Top10perc Top25perc
## Min. : 81 Min. : 72 Min. : 35 Min. : 1.00 Min. : 9.0
## 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00 1st Qu.: 41.0
## Median : 1558 Median : 1110 Median : 434 Median :23.00 Median : 54.0
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56 Mean : 55.8
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00 3rd Qu.: 69.0
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00 Max. :100.0
## F.Undergrad P.Undergrad Outstate Room.Board
## Min. : 139 Min. : 1.0 Min. : 2340 Min. :1780
## 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320 1st Qu.:3597
## Median : 1707 Median : 353.0 Median : 9990 Median :4200
## Mean : 3700 Mean : 855.3 Mean :10441 Mean :4358
## 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925 3rd Qu.:5050
## Max. :31643 Max. :21836.0 Max. :21700 Max. :8124
## Books Personal PhD Terminal
## Min. : 96.0 Min. : 250 Min. : 8.00 Min. : 24.0
## 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00 1st Qu.: 71.0
## Median : 500.0 Median :1200 Median : 75.00 Median : 82.0
## Mean : 549.4 Mean :1341 Mean : 72.66 Mean : 79.7
## 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00 3rd Qu.: 92.0
## Max. :2340.0 Max. :6800 Max. :103.00 Max. :100.0
## S.F.Ratio perc.alumni Expend Grad.Rate
## Min. : 2.50 Min. : 0.00 Min. : 3186 Min. : 10.00
## 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751 1st Qu.: 53.00
## Median :13.60 Median :21.00 Median : 8377 Median : 65.00
## Mean :14.09 Mean :22.74 Mean : 9660 Mean : 65.46
## 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :39.80 Max. :64.00 Max. :56233 Max. :118.00 ii pairs(college[,2:11])

iii boxplot(college\(Outstate~college\)Private)

iv Elite <- rep(“No”, nrow(college)) Elite[college\(Top10perc > 50] <- "Yes" Elite <- as.factor(Elite) college <- data.frame(college, Elite) summary(Elite) ## No Yes ## 699 78 boxplot(college\)Outstate~college$Elite)

There are 78 elite universities. par(mfrow = c(2, 2)) hist(college\(Personal,nclass=5) hist(college\)Personal,nclass=10) hist(college\(Personal,nclass=15) hist(college\)Personal,nclass=20)

vi I wish to investigate the relationship between rejection rate=1-Accept/Apps and Private reject_rate=1-college\(Accept/college\)Apps

boxplot(reject_rate~college$Private)

In this study, we took a brief look at the data of College. The Out-of-state tuition for private college is higher than public college. The Out-of-state tuition for Elite college is higher than that of non Elite. Overall the estimated personal spending is positively skewed and the rejection rate for private and public school do not differ much.