Project 1 - Machine Learning

Description of the Project

The following project is a work for the subject Machine Learning in which we will see 10 different problems and how to solved them using R Studio as our primary Software. I will give plenty explenations about the excersice and the solutions.

Problem #1

Statement: It is going to be computed a 6-number summary for “Sepal.Widht”, for each type of iris (Setosa, Versicolor and Virginica). Using the data “iris”.

Solution #1

Using the command “Summary()”, we have the following instructions to get the 6-number summary of each type.

summary(iris[iris$Species=="setosa",]$Sepal.Width)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.300   3.200   3.400   3.428   3.675   4.400

summary(iris[iris$Species=="versicolor",]$Sepal.Width)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.525   2.800   2.770   3.000   3.400

summary(iris[iris$Species=="virginica",]$Sepal.Width)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.200   2.800   3.000   2.974   3.175   3.800

Problem #2

Statement: It is going to be created a bloxpot of mpg usind the “mtcars” data, categorized by the number of cylinders. Also the numeric variable cyl will turn into a factor variable.

Solution #2

Using the command “factor()”, we stisfied the changed of variable and with the command “class()”, this has been proveen. The command “boxplot()” is used to make a boxplot based in what the statement requires.

cyl2 <- factor(mtcars$cyl)
class(cyl2)

## [1] "factor"

boxplot(mpg~cyl,data=mtcars, main="Number of Cylinders", xlab="Cylinders", ylab="miles per gallon")

Problem #3

Statement: For a random sample of size n = 100 from a normal distribution with mean u = 10 and a standard deviation sd = 5, compute the sample mean 1000 times. Get the mean and the Standard deviation of the 1000 sample means. As an additional, are the results consistent with the central limit theorem?

Solution #3

Using the command “replicate()”, we are able to do 1000 sample means, the command “rnorm()” make the random sample of size 100, normal distribution with mean 10 and standard deviation of 5, meanwhile the command “mean()” compute the mean of that random sample. with the commands “sd()” and “mean()” we can get the standard deviation and mean of the 1000 sample meands. To justify that our results are consistent with the central limit theorem, the commands “hist()” and “plot()” are used to representate the results and the correctness of the data.

means <- replicate(1000, mean(rnorm(100,10,5))) 
mean1 <- mean(means)
sd1 <- sd(means) 
hist(means)

mean1

## [1] 9.979682

sd1

## [1] 0.4590636

plot(mean1, sd1, main="1000 Samples", xlab="Mean of the sample means", ylab="Standard deviation of the sample means")

Are the results consistent with the central limit theorem?

Yes, they are. Because the mean of all samples it is approximately equal to the mean of the random sample of size 100. We can see this both in histogram and graph.

Problem #4

Statement: This problem requires to import the data from a .csv file and make two histograms of the logarithms of the data, place them side by side in the same plot.

Solution #4

To import the data from a .csv file, it must use the command “read.csv” assigning the data to a variable name (In this case the name is photos). To place the histograms side by side in the same plot, it will use the command “par()” and hence the command “hist()” will let us to create the two histograms, with the command “log()”.

photos = read.csv("C:/Users/cat_b/Desktop/Elizabethtown College/Machine Learning/Data/geese.txt")
par(mfrow = c(1,2))
hist(log(photos$Aestimate),main = paste("Histogram of", "Aestimate"), xlab = "Aestimate Data")
hist(log(photos$Bestimate),main = paste("Histogram of", "Bestimate"), xlab = "Bestimate Data")

Problem #5

Statement: Import the data from two .csv files, in this case “header = FALSE” has to be used, then rename the files’ columns and finally do a two sample t-test to determine if the mean from the data of the two files has changed.

Solution #5

It used the command “read.csv” but in this case with “header = FALSE”, that action let the files without and specific title for the columns. To assign or change the name of these columns, the command “names()” is used. For the two sample t-test it uses the command “t.test()” as we could see, now we can see the differences between the means and the mean of each file. Also with the command “mean()” we can know the mean of each file but that wont let us know the other values from our two sample t-test.

cereal79 = read.csv("C:/Users/cat_b/Desktop/Elizabethtown College/Machine Learning/Data/CerealSugar1979.txt", header = FALSE)
cereal06 = read.csv("C:/Users/cat_b/Desktop/Elizabethtown College/Machine Learning/Data/CerealSugar2006.txt", header = FALSE)
names(cereal79)[2] <- "SugarContent"
names(cereal06)[2] <- "SugarContent"
names(cereal06)[1] <- "Cereal"
names(cereal79)[1] <- "Cereal"
t.test(cereal79$SugarContent,cereal06$SugarContent, alternative = "less", var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  cereal79$SugarContent and cereal06$SugarContent
## t = -0.40946, df = 109, p-value = 0.3415
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf 3.664329
## sample estimates:
## mean of x mean of y 
##  26.76452  27.96531

mean(cereal06$SugarContent)

## [1] 27.96531

mean(cereal79$SugarContent)

## [1] 26.76452

Problem #6

Statement: Import the data from a .txt file, then Build a linear model to predict a value using three predictors. At the end type the final equation for the model.

Solution #6

To import the data from a .txt file, it will be used the command “read.table()” assigning the data to “retail” (Variable name). With the command “lm()” we get the linear model to predict a value based in the three predictors, also the data to build our equation about the linear model.

retail = read.table("C:/Users/cat_b/Desktop/Elizabethtown College/Machine Learning/Data/retail.txt", header = TRUE)
lm(formula = retail$Gross.Sales ~ retail$Gross.Cash + retail$Gross.Check + retail$Cash.Items, data = retail)

## 
## Call:
## lm(formula = retail$Gross.Sales ~ retail$Gross.Cash + retail$Gross.Check + 
##     retail$Cash.Items, data = retail)
## 
## Coefficients:
##        (Intercept)   retail$Gross.Cash  retail$Gross.Check  
##          -12.92979             0.08669             1.16480  
##  retail$Cash.Items  
##            7.62916

The Equation

Gross-Sales = -12.92979 + 0.08669(Gross.Cash) + 1.16480(Gross.Check) + 7.62916(Cash.Items)

Problem #7

Statement: Import the data from a .txt file, then do a one way ANOVA, interpreting the results. Confirm that the standard deviation of the groups do not be more than twice any other, this condition has to be satisfied.

Solution #7

To import the data, we use the command “read.table()”, then we create a boxplot and assign the linear model to a new variable, allof this with the commands “boxplot()” and “lm()”, the command “summary()” will help to get additional data. To do a one way ANOVA we use the command “anova()”, the command “confint” is used to calculate confidence intervals on the parameters. In order to confirm the condition we check the mean, step by step,

kudzu = read.table("C:/Users/cat_b/Desktop/Elizabethtown College/Machine Learning/Data/kudzu.txt", header = TRUE)
boxplot(BMD ~ Treatment, data = kudzu, main = "Kudzu Experiment", xlab = "Treatment", ylab = "BMD")

kudzu1 = lm(BMD ~ Treatment, data = kudzu)
summary(kudzu1)

## 
## Call:
## lm(formula = BMD ~ Treatment, data = kudzu)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.029067 -0.009867  0.000067  0.009933  0.031933 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.218867   0.003709  59.007  < 2e-16 ***
## TreatmentHighDose  0.016200   0.005246   3.088  0.00356 ** 
## TreatmentLowDose  -0.002933   0.005246  -0.559  0.57900    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01437 on 42 degrees of freedom
## Multiple R-squared:  0.2688, Adjusted R-squared:  0.2339 
## F-statistic: 7.718 on 2 and 42 DF,  p-value: 0.001397

anova(kudzu1)

## Analysis of Variance Table
## 
## Response: BMD
##           Df    Sum Sq    Mean Sq F value   Pr(>F)   
## Treatment  2 0.0031856 0.00159282  7.7182 0.001397 **
## Residuals 42 0.0086676 0.00020637                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

confint(kudzu1)

##                          2.5 %      97.5 %
## (Intercept)        0.211381217 0.226352116
## TreatmentHighDose  0.005613975 0.026786025
## TreatmentLowDose  -0.013519358 0.007652691

sd(kudzu[1:15,2])

## [1] 0.01158735

sd(kudzu[16:30,2])

## [1] 0.01151066

sd(kudzu[31:45,2])

## [1] 0.01877105

by(kudzu$BMD, kudzu$Treatment, sd)

## kudzu$Treatment: Control
## [1] 0.01158735
## -------------------------------------------------------- 
## kudzu$Treatment: HighDose
## [1] 0.01877105
## -------------------------------------------------------- 
## kudzu$Treatment: LowDose
## [1] 0.01151066

One way ANOVA and condition.

The value, given by the one way ANOVA, of “Pr” is equal to 0.001397. The condition has been satisfied because no any of the group standard deviation is more than twice any other.

Problem #8

Statement: Is required to determine the number of observations for which the value it’s equal to “24” from a .csv file.

Solution #8

To import the data from a .csv file, we use the command “read.csv” assigning the data to “eight” (Variable name). To determine the number, it will use the commands “lenght()” and “which()” in just one line, the second one will let us know which data has a value equal to 24, and the first one will allow us to know how many.

eight = read.csv("C:/Users/cat_b/Desktop/Elizabethtown College/Machine Learning/Data/getdata%2Fdata%2Fss06hid.csv")
length(which(eight[,1:188] == 24))

## [1] 6226

Problem #9

Statement: Import the data from a .accdb file to produce a histogram for the logarithm of the result, from two columns.

Solution #9

To import the data from a .accdb file we have to export the table to a .txt file, then with the command “read.csv()” we get the data. The command “names()” will help to change the names of the columns. To make the histograms of the logariths of the two colums we will use the commands “log” and hist “hist” as it follows.

country = read.csv("C:/Users/cat_b/Desktop/Elizabethtown College/Machine Learning/Data/country.txt", header = FALSE)
names(country)[6] <- "Population"
names(country)[5] <- "UnitArea"
hist(log(country$Population/country$UnitArea), main = paste("Histogram of population density"), xlab = "Population density", ylab = "Number of Countries")

Problem #10

Statement: Import the data from a .accdb file. Use the data to make points over a map from the world.

Solution #10

To import the data from a .csv file, we choose to export the table to a .txt file, then with the command “library()” we call the library maps to draw the map of the world. To make the points, it will be used the command “points()”.

library(maps)
cities <- read.csv("C:/Users/cat_b/Desktop/Elizabethtown College/Machine Learning/Data/City.txt")
map("world", fill=TRUE, col="white", bg="lightblue")
points(cities$Longitude,cities$Latitude, col="red", pch = 16)