DAT 315 Machine Learning

Problem 1

Use the iris dataset to compute a 6-number summary (min, max, Q1, Q3, median, mean) for Sepal.Width for each type of iris (setosa, versicolor, virginica). Consider using the by command.

by(iris$Sepal.Width, iris$Species, summary)

## iris$Species: setosa
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.300   3.200   3.400   3.428   3.675   4.400 
## -------------------------------------------------------- 
## iris$Species: versicolor
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.525   2.800   2.770   3.000   3.400 
## -------------------------------------------------------- 
## iris$Species: virginica
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.200   2.800   3.000   2.974   3.175   3.800

The code seperates the data by species and gives the 6-number summary of the sepal with for each.

Problem 2

In the mtcars dataset, the number of cylinders (cyl) is numeric. Create a new variable cyl2 that is a factor variable and use it to make a boxplot of mpg categorized by the number of cylinders labeling the axes as “Number of Cylinders”and “Miles per Gallon”.1

num2 <- mtcars
num2$cyl2 <- as.factor(num2$cyl)
boxplot(mpg ~ cyl2, data=num2,
     xlab="Number of Cylinders", 
     ylab="Miles per gallon", 
     pch=19, 
     col =rgb(0,1,0,1/2)
     )

After changing the cylinder (cyl) variable from a numeric variable to a factor variable, we are able to create a boxplot to the display the cylinder variable 5-number summary.

Problem 3

Take a random sample of size n= 75 from a uniform distribution on [0,6] and compute the sample mean. Repeat thisprocess 999 more times to produce a total of 1000 sample means. (Please do not display the data in your Markdowndocument!) Compute the mean and standard deviation of the 1000 sample means. (You might find therunif(),replicate(), andapply()functions to be helpful.) Are your results consistent with the central limit theorem?Explain.

data3 <- replicate(1000, mean(runif(75, min = 0, max = 6)))
hist(data3,
     main = "Histrogram of Means",
     xlab="Means",
     ylab="Frequency",
     col = rgb(0,1,1,1/4))

mean(data3)

## [1] 3.008976

sd(data3)

## [1] 0.2003712

Are your result consitent with the central limit theorem?

The mean of the sample means is approximately 3 and the standard deviation is approximately \(\frac {1} {5}\). The central limit theorem states that when taking a statistically large number of random samples with mean = \(\mu\) and standard deviation = \(\sigma\) then the sample means should be approximately normal with \(\mu _{\overline {x}}=\mu\) and \(\sigma _{\overline {x}}=\frac {\sigma } {\sqrt {n}}\). Based off of this theorem, \(\mu _{\overline {x}}\) should be equal to \(\frac {\left( 6-0\right) } {2}=3\), the mean of a uniform distribution, and \(\sigma _{\overline {x}}\) should equal \(\frac {\sqrt {\left( 6-0\right) ^{2} / 12}} {\sqrt {75}}=\frac {1} {5}\), the standard deviation of a uniform distribution divided by the root of the sample size, which is true, and the histogram shows clearly that our sample is approximately normal.

Problem 4

Use read.csv() to import the data in T:/Faculty & Staff Alphabetical/M/McDevittt/Public/DAT315/Data/geese.txt into R. Make histograms of the logarithms of Aestimate and Bestimate and place them side-by-side in the sameplot and label each using main.

geeseData <- read.csv("geese.csv")
par(mfrow=c(1,2))
# Aestimate Histogram
hist(log(geeseData$Aestimate), col=rgb(0,0,1,1/4),
    main="Histogram of A estimate",
     xlab="Log Aestimate", 
     ylab="Frequency"
      ) 
# Bestimate Histogram
hist(log(geeseData$Bestimate), col=rgb(1,0,0,1/4),
    main="Histogram of B estimate",
     xlab="Log Bestimate", 
     ylab="Frequency"
     )

Problem 5

Use read.csv()withheader=FALSEto import the data inT:/Faculty & Staff Alphabetical/M/McDevittt/Public/MA251/Data/CerealSugar1979.txt and T:/Faculty & StaffAlphabetical/M/McDevittt/Public/MA251/Data/CerealSugar2006.txt. Use names()to rename the columns Cereal and SugarContent, respectively. Then do a two-sample t-test to determine if the mean sugar content of cereals has changed from 1979 to 2006.

sugarData1979 <- read.csv("CerealSugar1979.txt", header = FALSE, sep = ")")
sugarData2006 <- read.csv("CerealSugar2006.txt", header = FALSE, sep = ")")
names(sugarData1979)<-c("Cereal", "SugarContent")
names(sugarData2006)<-c("Cereal", "SugarContent")
t.test(sugarData1979$SugarContent,
       sugarData2006$SugarContent)

## 
##  Welch Two Sample t-test
## 
## data:  sugarData1979$SugarContent and sugarData2006$SugarContent
## t = -0.42044, df = 108.99, p-value = 0.675
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.861317  4.459737
## sample estimates:
## mean of x mean of y 
##  26.76452  27.96531

Running the two-sample t-test provides a p-value of .675. Using a standard \(\alpha=.05\) and since .675 is not less than .05 we do not have enough evidence to conclude that the difference between the means of the two sugar contents is statistically significant.

Problem 6

Use read.table()to import the data in T:/Faculty & Staff Alphabetical/M/McDevittt/Public/MA252/Data/PC-Text/Chapter 11/retail.txt. Build a linear model to predict Gross-Sales using Gross-Cash,Cash-Items, and Gross-Checkas predictors. Interpret the R output to type the resulting equation for the model.

retail <- read.table("retail.txt", header = TRUE)
model6 <- lm(Gross.Sales ~ Gross.Cash + Cash.Items + Gross.Check, data = retail)
summary(model6)

## 
## Call:
## lm(formula = Gross.Sales ~ Gross.Cash + Cash.Items + Gross.Check, 
##     data = retail)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -85.96 -29.67  -5.64  29.02  93.42 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.92979   22.40870  -0.577   0.5701    
## Gross.Cash    0.08669    0.46887   0.185   0.8551    
## Cash.Items    7.62916    2.86987   2.658   0.0147 *  
## Gross.Check   1.16480    0.13066   8.915 1.39e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.68 on 21 degrees of freedom
## Multiple R-squared:  0.9307, Adjusted R-squared:  0.9208 
## F-statistic: 94.02 on 3 and 21 DF,  p-value: 2.462e-12

#plot(retail$Gross.Sales, model6$fitted.values)
#abline(0,1, col="blue")

Using R to build a linear model, we found that \[GrossSale = -12.92979 + 0.08669*GrossCash + 7.62916*CashItems + 1.16480*GrossCheck\] If GrossCash increases by 1 unit, GrossSales will increase by .08669. If CashItems increases by 1 unit, GrossSales increases by 7.62916. If GrossCheck increase by 1 unit, GrossSales increase by 1.16480. Finally, if all independent variables are zero, there will be a loss for gross-sales equal to 12.92979.

Problem 7

Kudzu is a plant that was imported to the United States from Japan and now covers over seven million acres in theSouth. The plant contains chemicals called isoflavones that have been shown to have beneficial effects on bones. Onestudy at Purdue University used three groups of rats to compare a control group with rats that were fed either a lowdose or a high dose of isoflavones from kudzu. One of the outcomes examined was the bone mineral density in the femur(in grams per square centimeter). Use the data in T:/Faculty & Staff Alphabetical/M/McDevittt/Public/MA252/Data/PC-Text/Chapter 14/kudzu.txt to do a one-way ANOVA and interpret the result. Recall that we should not perform aone-way ANOVA if any of the group standard deviations is more than twice any other. Use by() and sd() to checkthat this condition is satisfied.

tbl = read.table("kudzu.txt", header = TRUE)
by(tbl$BMD, tbl$Treatment, sd)

## tbl$Treatment: Control
## [1] 0.01158735
## -------------------------------------------------------- 
## tbl$Treatment: HighDose
## [1] 0.01877105
## -------------------------------------------------------- 
## tbl$Treatment: LowDose
## [1] 0.01151066

# Analysis of Variance
res.aov <- aov(BMD~Treatment, data = tbl)
# Analysis Summary
summary(res.aov)

##             Df   Sum Sq   Mean Sq F value Pr(>F)   
## Treatment    2 0.003186 0.0015928   7.718 0.0014 **
## Residuals   42 0.008668 0.0002064                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Given standard diviations of \(.01158735\), \(.01877105\), and \(.01151066\) for control, high dose, and low dose treatments respectively, it is clear that no \(\sigma\) is more than twice any other so we may use a one-way ANOVA. Using a significances level equal to \(.01\) and our p-value of \(.0014\), we have enough evidence to conclude that at least one average bone mineral density of the treatments types is significantly different because \(.0014\) is less than \(.01\).

Problem 8

Download the .csv file from https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv and use R to determine the number of observations for whichVAL= 24.

paste("Data8 csv was downloaded ", date(), sep = "")

## [1] "Data8 csv was downloaded Sun Feb  3 18:41:48 2019"

download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv", destfile = "./data8.csv", method="curl")
data8 = read.csv("data8.csv")
table(data8$VAL)

## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##  75  42  33  30  26  29  23  70  99 119 152 199 233 495 483 486 357 502 
##  19  20  21  22  23  24 
## 232 312 164 159  47  53

Data8 csv was downloaded Sat Feb 2 22:03:57 2019. The number of observations with vall = 24 is 53.

Problem 9

Use the data in T:/Faculty & Staff Alphabetical/M/McDevittt/Public/MA463/MONDIAL.accdb to produce a his-togram for the logarithm of the population density of every country (log(number of people/unit area)).

country <- read.csv("Country.txt", header=FALSE)
model9 <- hist(log(country$V6/country$V5),
               col=rgb(0,0,1,1/4),
               main="Histogram of Population Density",
               xlab="Population Density", 
               ylab="Frequency"
      )

In order to run the data located in the access file, we had to export the country data to a txt file. To export the country data, select the country table in access, click on the External Data tab, and then select Text File. Once you do this a window will pop up… Click “OK”. DO NOT select the check box that reads: “Export Data with Formatting and layout”… this checkbox should be blank. Another window will appear, click Finish and then close the window. Save the txt file to the folder where your R project is located and name it “Country”.

Problem 10

Use map()from the maps package to make a map of the world. Then usepoints()to add the cities in T:/Faculty &Staff Alphabetical/M/McDevittt/Public/MA463/MONDIAL.accdb to the map. Your graph should look somethinglike the following.

City <- read.csv("City.csv", header=TRUE)
library(maps)
map("world", fill=TRUE, col="white", bg=rgb(0,0,1,1/4))
points(City$Longitude,City$Latitude,
       col=rgb(1,0,1,2/4),
       pch=20
      )

In order to run the data located in the access file, we must export the City data into a CSV file type. To do this, first select the City data table in access, click on the external data tab, and then select Excel. A pop-up window will appear, click okay. Locate the new saved Excel file, open it, save it with the title “City” in a “CSV (comma delimited)” format and save it to the folder where your R project is located.

DAT 315 Machine Learning - Project 1

Mackenzie Garner

Anthony Knight

Tuyen Le

02/02/2019