1. This exercise relates to the College data set, which can be found in the file College.csv . It contains a number of variables for 777 different universities and colleges in the US. The variables are
    • Private : Public/private indicator
    • Apps : Number of applications received
    • Accept : Number of applicants accepted
    • Enroll : Number of new students enrolled
    • Top10perc : New students from top 10 % of high school class
    • Top25perc : New students from top 25 % of high school class
    • F.Undergrad : Number of full-time undergraduates
    • P.Undergrad : Number of part-time undergraduates
    • Outstate : Out-of-state tuition
    • Room.Board : Room and board costs
    • Books : Estimated book costs
    • Personal : Estimated personal spending
    • PhD : Percent of faculty with Ph.D.’s
    • Terminal : Percent of faculty with terminal degree
    • S.F.Ratio : Student/faculty ratio
    • perc.alumni : Percent of alumni who donate
    • Expend : Instructional expenditure per student
    • Grad.Rate : Graduation rate
    Before reading the data into R , it can be viewed in Excel or a text editor.
  1. Use the read.csv() function to read the data into R . Call the loaded data college . Make sure that you have the directory set to the correct location for the data.
  2. Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later.
# (a)
college = read.csv("College.csv")
rownames(college) = college[ ,1]
fix(college)

Try the following commands
>rownames ( college ) = college [ ,1]
>fix ( college )

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try

college = college[ ,-1]
fix(college)

college = college [ , -1]
fix ( college )

Now you should see that the first data column is Private . Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.
(c)
i. Use the summary() function to produce a numerical summary of the variables in the data set.

summary(college)
 Private        Apps           Accept          Enroll       Top10perc       Top25perc      F.Undergrad     P.Undergrad     
 No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00   Min.   :  9.0   Min.   :  139   Min.   :    1.0  
 Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0  
           Median : 1558   Median : 1110   Median : 434   Median :23.00   Median : 54.0   Median : 1707   Median :  353.0  
           Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56   Mean   : 55.8   Mean   : 3700   Mean   :  855.3  
           3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0  
           Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00   Max.   :100.0   Max.   :31643   Max.   :21836.0  
    Outstate       Room.Board       Books           Personal         PhD            Terminal       S.F.Ratio    
 Min.   : 2340   Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00   Min.   : 24.0   Min.   : 2.50  
 1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50  
 Median : 9990   Median :4200   Median : 500.0   Median :1200   Median : 75.00   Median : 82.0   Median :13.60  
 Mean   :10441   Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66   Mean   : 79.7   Mean   :14.09  
 3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50  
 Max.   :21700   Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00   Max.   :100.0   Max.   :39.80  
  perc.alumni        Expend        Grad.Rate     
 Min.   : 0.00   Min.   : 3186   Min.   : 10.00  
 1st Qu.:13.00   1st Qu.: 6751   1st Qu.: 53.00  
 Median :21.00   Median : 8377   Median : 65.00  
 Mean   :22.74   Mean   : 9660   Mean   : 65.46  
 3rd Qu.:31.00   3rd Qu.:10830   3rd Qu.: 78.00  
 Max.   :64.00   Max.   :56233   Max.   :118.00  
  1. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10] .
pairs(college[ ,1:10])

  1. Use the plot() function to produce side-by-side boxplots of Outstate versus Private .
attach(college)
The following objects are masked from college (pos = 3):

    Accept, Apps, Books, Enroll, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad, perc.alumni, Personal,
    PhD, Private, Room.Board, S.F.Ratio, Terminal, Top10perc, Top25perc

The following objects are masked from college (pos = 4):

    Accept, Apps, Books, Enroll, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad, perc.alumni, Personal,
    PhD, Private, Room.Board, S.F.Ratio, Terminal, Top10perc, Top25perc

The following objects are masked from college (pos = 5):

    Accept, Apps, Books, Enroll, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad, perc.alumni, Personal,
    PhD, Private, Room.Board, S.F.Ratio, Terminal, Top10perc, Top25perc

The following objects are masked from college (pos = 6):

    Accept, Apps, Books, Enroll, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad, perc.alumni, Personal,
    PhD, Private, Room.Board, S.F.Ratio, Terminal, Top10perc, Top25perc

The following objects are masked from college (pos = 7):

    Accept, Apps, Books, Enroll, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad, perc.alumni, Personal,
    PhD, Private, Room.Board, S.F.Ratio, Terminal, Top10perc, Top25perc

The following objects are masked from college (pos = 8):

    Accept, Apps, Books, Enroll, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad, perc.alumni, Personal,
    PhD, Private, Room.Board, S.F.Ratio, Terminal, Top10perc, Top25perc

The following objects are masked from college (pos = 9):

    Accept, Apps, Books, Enroll, Expend, F.Undergrad, Grad.Rate, Outstate, P.Undergrad, perc.alumni, Personal,
    PhD, Private, Room.Board, S.F.Ratio, Terminal, Top10perc, Top25perc
plot(Private, Outstate)

  1. Create a new qualitative variable, called Elite , by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10 % of their high school classes exceeds 50 %.
Top10perc = as.factor(Top10perc)

Elite = rep (" No " , nrow ( college ) )
Elite [ college$Top1 0 pe rc >50]=" Yes "
Elite = as . factor ( Elite )
college = data . frame ( college , Elite )

Elite = rep("No", nrow(college))
Elite[college$Top10perc > 50] = "Yes"
Elite = as.factor(Elite)
college = data.frame(college, Elite)
# Entendi nada...
# Ok, acho que entendi. Primeiro aquela função rep() vai escrever todas as linhas da college como não nessa nova variável
# Elite.
# Depois, vai escrever um "Yes" em cada linha que Top10perc for maior que 50%
# Finalmente vai transformar Elite em qualitativo e vai criar um data frame college, incluindo a coluna Elite.

Use the summary() function to see how many elite univer- sities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite .

summary(Elite)
 No Yes 
699  78 
  1. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative vari- ables. You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.
par(mfrow=c(2,2))
hist(Apps)
hist(Enroll)
hist(Personal)
hist(PhD)

  1. Continue exploring the data, and provide a brief summary of what you discover.

Ok, deu pra pegar a ideia

  1. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.
Auto = read.table("Auto.data", header=T, na.strings="?")
Auto = na.omit(Auto)
summary(Auto)
      mpg          cylinders      displacement     horsepower        weight      acceleration        year      
 Min.   : 9,00   Min.   :3,000   Min.   : 68,0   Min.   : 46,0   Min.   :1613   Min.   : 8,00   Min.   :70,00  
 1st Qu.:17,00   1st Qu.:4,000   1st Qu.:105,0   1st Qu.: 75,0   1st Qu.:2225   1st Qu.:13,78   1st Qu.:73,00  
 Median :22,75   Median :4,000   Median :151,0   Median : 93,5   Median :2804   Median :15,50   Median :76,00  
 Mean   :23,45   Mean   :5,472   Mean   :194,4   Mean   :104,5   Mean   :2978   Mean   :15,54   Mean   :75,98  
 3rd Qu.:29,00   3rd Qu.:8,000   3rd Qu.:275,8   3rd Qu.:126,0   3rd Qu.:3615   3rd Qu.:17,02   3rd Qu.:79,00  
 Max.   :46,60   Max.   :8,000   Max.   :455,0   Max.   :230,0   Max.   :5140   Max.   :24,80   Max.   :82,00  
                                                                                                               
     origin                      name    
 Min.   :1,000   amc matador       :  5  
 1st Qu.:1,000   ford pinto        :  5  
 Median :1,000   toyota corolla    :  5  
 Mean   :1,577   amc gremlin       :  4  
 3rd Qu.:2,000   amc hornet        :  4  
 Max.   :3,000   chevrolet chevette:  4  
                 (Other)           :365  
  1. Which of the predictors are quantitative, and which are quali- tative?

We have one qualitative column and 8 quantitative column

  1. What is the range of each quantitative predictor? You can an- swer this using the range() function.
attach(Auto)
The following objects are masked from Auto (pos = 3):

    acceleration, cylinders, displacement, horsepower, mpg, name, origin, weight, year

The following objects are masked from Auto (pos = 4):

    acceleration, cylinders, displacement, horsepower, mpg, name, origin, weight, year

The following objects are masked from Auto (pos = 5):

    acceleration, cylinders, displacement, horsepower, mpg, name, origin, weight, year

The following objects are masked from Auto (pos = 6):

    acceleration, cylinders, displacement, horsepower, mpg, name, origin, weight, year
range(mpg)
[1]  9,0 46,6
range(cylinders)
[1] 3 8
range(acceleration)
[1]  8,0 24,8
range(displacement)
[1]  68 455
range(horsepower)
[1]  46 230
range(origin)
[1] 1 3
range(weight)
[1] 1613 5140
range(year)
[1] 70 82
  1. What is the mean and standard deviation of each quantitative predictor?
mean(mpg)
[1] 23,44592
mean(cylinders)
[1] 5,471939
mean(acceleration)
[1] 15,54133
mean(displacement)
[1] 194,412
mean(horsepower)
[1] 104,4694
mean(origin)
[1] 1,576531
mean(weight)
[1] 2977,584
mean(year)
[1] 75,97959
sd(mpg)
[1] 7,805007
sd(cylinders)
[1] 1,705783
sd(acceleration)
[1] 2,758864
sd(displacement)
[1] 104,644
sd(horsepower)
[1] 38,49116
sd(origin)
[1] 0,8055182
sd(weight)
[1] 849,4026
sd(year)
[1] 3,683737
  1. Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
Auto = Auto[-(10:84),]
attach(Auto)
The following objects are masked from Auto (pos = 3):

    acceleration, cylinders, displacement, horsepower, mpg, name, origin, weight, year

The following objects are masked from Auto (pos = 4):

    acceleration, cylinders, displacement, horsepower, mpg, name, origin, weight, year

The following objects are masked from Auto (pos = 5):

    acceleration, cylinders, displacement, horsepower, mpg, name, origin, weight, year

The following objects are masked from Auto (pos = 6):

    acceleration, cylinders, displacement, horsepower, mpg, name, origin, weight, year

The following objects are masked from Auto (pos = 7):

    acceleration, cylinders, displacement, horsepower, mpg, name, origin, weight, year
mean(mpg)
[1] 27,9509
mean(cylinders)
[1] 5
mean(acceleration)
[1] 15,89222
mean(displacement)
[1] 164,2695
mean(horsepower)
[1] 93,52695
mean(origin)
[1] 1,712575
mean(weight)
[1] 2727,91
mean(year)
[1] 79,23952
sd(mpg)
[1] 7,657203
sd(cylinders)
[1] 1,509009
sd(acceleration)
[1] 2,767126
sd(displacement)
[1] 87,60023
sd(horsepower)
[1] 31,88776
sd(origin)
[1] 0,8788122
sd(weight)
[1] 651,5159
sd(year)
[1] 2,68465
  1. Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

Não sei fazer

  1. Suppose that we wish to predict gas mileage ( mpg ) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg ? Justify your answer.

Sei não senhor

  1. This exercise involves the Boston housing data set.
  1. To begin, load in the Boston data set. The Boston data set is part of the MASS library in R .
    > library ( MASS )

Now the data set is contained in the object Boston.

Boston

Read about the data set:
> ? Boston

How many rows are in this data set? How many columns? What do the rows and columns represent?
(b) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.
(c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
(d) Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
(e) How many of the suburbs in this data set bound the Charles river?
(f) What is the median pupil-teacher ratio among the towns in this data set?
(g) Which suburb of Boston has lowest median value of owner- occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.
(h) In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.

