1(a)
#The data for this homework is on the class web site. The data set HW01pb2data.zip is a zipped version of HW01pb2data.csv
#1) This question uses the data HW01pb1data.csv. Download it to your computer.a) Read in the data in R usingdata<-read.csv("HW01pb1data.csv",header=FALSE). Note, you first need tospecify your working directory using the setwd() command. Determine whether eachof the attributes (columns) are treated as qualitative (categorical) or quantitative (numeric) using R. Explain how you can tell using R.
setwd("C:/Users/Manjari/Desktop/Machine learning/Home Work Solutions")
myData<-read.csv("HW01pb1data.csv",header=FALSE)
attributes_type<- c()
for(i in 1:5){
attributes_type[i]<-class(myData[,i])
}
attributes_type
## [1] "integer" "integer" "integer" "factor" "factor"
####The first, second and third attributes are treated as quantitative value beacuse its attribute type is integer.The fouth and fifth ones are treated as qualitative value because its attribute type is factor.
#b) What is the specific problem that causes two of these attributes to be read in as qualitative (categorical) when it seems it should be quantitative (numeric)?
#1(b)
for (i in 1:5){
if(!is.null(levels(myData[,i])))
sprintf("%d th column:/n",i)
print(levels(myData[,i]))
}
## NULL
## NULL
## NULL
## [1] "0" "10" "100" "110" "120"
## [6] "140" "15" "150" "160" "20"
## [11] "200" "25" "30" "35" "40"
## [16] "5" "50" "55" "60" "65"
## [21] "70" "80" "85" "90" "thirty five"
## [1] "0" "10" "120" "140" "15"
## [6] "20" "25" "255" "30" "35"
## [11] "40" "45" "5" "50" "55"
## [16] "60" "70" "80" "twenty five"
#The problem is 4th and 5th columns have the factors.
#c) Use the command plot() in R to make a plot for column 1 by entering plot(data[,1]) .Use a similar command to plot column 4 (that is plot(data[,4])). Because one variable is read in as quantitative (numeric) and the other as qualitative (categorical) these two plots are showing completely different things by default. Explain exactly what is being plotted in each case. Include these plots in your homework.
##1(c)
plot(myData[,1])
#The command to plot column 1 scatters data on a x-y axis. It uses index number as a x-y values
plot(myData[,4])
#The command to plot column 4 a histogram graph. It uses factors to count how many element are in each factors.
#d) (optional) Read the data into Excel. Excel should have no problem opening the file directly since it is .csv. Create a new column that is equal to the forth column plus 10. What is the result for the problem observations (rows) you identified in part b? What specific outcome does Excel display?
#1(d)
#when a sell has a number as its value, a new sell in 6th column gets new value. Otherwise,when a sell has a string as its value, such as "thirty five" a new sell in 6th column get no value, and returns an error like below:"The operation "+"expected a number,date or duration, but cell D405 contain a string."
#2) This question uses the data in the file HW01pb2data.csv. Download it to your computer. a) Read the data into R using data<-read.csv("HW01pb2data.csv",header=FALSE). Note, you first need to specify your working directory using the setwd() command. Extract a simple random sample with replacement of 10,000 observations (rows). (Hint: R has a function called sample) Show your R commands for doing this.
#2(a)
getwd()
## [1] "C:/Users/Manjari/Desktop/Machine learning/Home Work Solutions"
setwd("C:/Users/Manjari/Desktop/Machine learning/Home Work Solutions")
data <- read.csv("HW01pb2data.csv",header=FALSE)
sample_data<-sample(data[,1],10000,replace=TRUE)
#b) For your sample, use the functions mean(), max(), var() and quantile(,.25) to comput the mean, maximum, variance and 1st quartile respectively. Show your R code and the resulting values.
#2(b)
mean(sample_data)
## [1] 9.460937
max(sample_data)
## [1] 16.5152
var(sample_data)
## [1] 4.085747
quantile(sample_data,.25)
## 25%
## 8.10149
#c) Compute the same quantities in part b on the entire data set and show your answers.How much do they differ from your answers in part b?
#2(c)
mean(data$V1)
## [1] 9.451468
max(data$V1)
## [1] 18.96657
var(data$V1)
## [1] 4.001822
quantile(data$V1,0.25)
## 25%
## 8.10388
#d) (Optional Part) Save your sample from R to a csv file using the command write.csv(). Then open this file with Excel and compute the mean, maximum, variance and 1st quartile. Provide the values and name the Excel functions you used to compute these.
#2(d)
write.csv(sample_data, file = "sample_data.csv")
#2(e) (Optional Part) Exactly what happens if you try to open the full data set with Excel?
#when I tried to open the open the data set, I am able to 65,565 object showed only out of 2000000 data.I think it depending on the version excel has limits.
setwd("C:/Users/Manjari/Desktop/Machine learning/Home Work Solutions")
ocean <- read.csv("HW01pb3OceanViewdata.csv", header = FALSE)
desert <- read.csv("HW01pb3Desertdata.csv", header = FALSE)
#3(a) This question uses a sample of 2000 Ocean View house prices in the file HW01pb3OceanViewdata.csv and a sample of 5000 Desert house prices in the file HW01pb3Desertdata.csv. Download both data sets to your computer. Note that the house prices are in thousands of dollars. (Hint: look at the file MyFirstRLesson.r) Use R to produce a single graph displaying a box plot for each set. Include the R commands and the plot. Put a name in the title of the plot (for example, main="House Box Plots"). Explain the box plot.
class(ocean)
## [1] "data.frame"
boxplot(ocean, at = 1, xlim = c(0.5, 2.5), ylim = range(c(ocean, desert)), main = "House Box Plots")
boxplot(desert, at = 2, add = TRUE)
#The average prices of ocean view houses is much higher than the average of the price of houses in the desert. The data of ocean view is houses is almost symmetrically distributed.Otherwise, the data of desert houses is right-skewed. There are more outliers in the desert data set than the ocean data set.
#3(b) Use R to produce a frequency histogram for only the Ocean View house prices. Use intervals of width $500,000 beginning at 0 and ending at $3 million. Include the R commands and the plot. Create an appropriate title for the plot. (Hint: Use the hist R command)
names(ocean)[1] <- "HousePrice"
breaks <- c(0, 500, 1000, 1500, 2000, 3000)
hist(ocean$HousePrice, breaks, main = "Ocenview House Distribution by Price",xlab = "Houce Price")
#3(c)The empirical cumulative distribution function is described in the web site: http://en.wikipedia.org/wiki/ECDF Use R to plot the ECDF of the Ocean View houses and Desert houses on the same graph. Include a legend. Include the R commands and the plot. Create a title for the plot.
ecdf(ocean$HousePrice)
## Empirical CDF
## Call: ecdf(ocean$HousePrice)
## x[1:567] = 787, 1029, 1052, ..., 2133, 2401
plot(ecdf(ocean$HousePrice), main = "Empirical Cummulative Distribution Function of Ocenview Houses")
#4(a) This question uses the Orange data set which is included in the R download. Type in the r command: orange <- as.data.frame(Orange). The data frame, orange, consists of three columns: Tree, age, and circumference.a) Use plot() in R to make a scatter plot for this data with age on the x-axis and circumference on the y-axis. What range should be given for the x-axis? What about the y-axis range? Create an appropriate title for the plot. Include the R commands and the plot.
orange <- as.data.frame(Orange)
plot(orange$age,orange$circumference,main = "Scatter: Age by Circumference",xlim=c(min(orange$age),max(orange$age)),ylim=c(min(orange$circumference),max(orange$circumference)))
#4(b)) Compute the correlation between the age and circumference of the first tree in R using the function cor().
cor(orange$age, orange$circumference)
## [1] 0.9135189
#4(c) For this problem you may want to use some the following R functions: names, merge,cov, and cor. Create a covariance - correlation chart which has the covariance and correlation of the age and circumference for each tree. Have your code print out the following chart with the same titles and the values filled in.
#TREE COVARIANCE CORRELATION
#1 1
#2 2
#3 3
#4 4
#5 5
#Other functions that may be of interest are ddply, tapply, and subset. The ddply() function is in the ply package. Check out http://had.co.nz/plyr
library(plyr)
ddply(orange, .(Tree), summarize, cor = cor(age, circumference), cov = cov(age, circumference))
## Tree cor cov
## 1 3 0.9881766 22239.83
## 2 1 0.9854675 22340.07
## 3 5 0.9877376 30442.81
## 4 2 0.9873624 34290.45
## 5 4 0.9844610 37062.62
covarience<-data.frame(Tree=c(1:max(orange$Tree)),covarience=c(1))
correlation<-data.frame(Tree=c(1:max(orange$Tree)),correlation=c(1))
for(i in 1:nrow(correlation)){
covarience[i,2]<-cov(orange$age[orange$Tree==i],orange$circumference[orange$Tree==i])
correlation[i,2]<-cor(orange$age[orange$Tree==i],orange$circumference[orange$Tree==i])
}
covacorr<-merge(covarience,correlation,"Tree")
print(covacorr)
## Tree covarience correlation
## 1 1 22340.07 0.9854675
## 2 2 34290.45 0.9873624
## 3 3 22239.83 0.9881766
## 4 4 37062.62 0.9844610
## 5 5 30442.81 0.9877376
#4(d)How do the values in part c) change if you add 10 to all the circumference values?
covplus <-data.frame(Tree=c(1:max(orange$Tree)),covplus=c(1))
corplus <-data.frame(Tree=c(1:max(orange$Tree)),corplus=c(1))
for(i in 1:nrow(correlation)){
covplus[i,2]<-cov(orange$age[orange$Tree==i],orange$circumference[orange$Tree==i]+10)
corplus[i,2]<-cor(orange$age[orange$Tree==i],orange$circumference[orange$Tree==i]+10)
}
covcorplus<-merge(covacorr,covplus,"Tree")
covcorplus<-merge(covcorplus,corplus,"Tree")
print(covcorplus)
## Tree covarience correlation covplus corplus
## 1 1 22340.07 0.9854675 22340.07 0.9854675
## 2 2 34290.45 0.9873624 34290.45 0.9873624
## 3 3 22239.83 0.9881766 22239.83 0.9881766
## 4 4 37062.62 0.9844610 37062.62 0.9844610
## 5 5 30442.81 0.9877376 30442.81 0.9877376
#4(e) How does the value in part c) change if you multiply all the circumference values by 2?
covmulti<-data.frame(Tree=c(1:max(orange$Tree)),covmulti=c(1))
cormulti<-data.frame(Tree=c(1:max(orange$Tree)),cormulti=c(1))
for(i in 1:nrow(correlation)){
covmulti[i,2]<-cov(orange$age[orange$Tree==i],orange$circumference[orange$Tree==i]*2)
cormulti[i,2]<-cov(orange$age[orange$Tree==i],orange$circumference[orange$Tree==i]*2)
}
covcormulti<-merge(covacorr,covmulti,"Tree")
covcormulti<-merge(covcormulti,cormulti,"Tree")
print(covcormulti)
## Tree covarience correlation covmulti cormulti
## 1 1 22340.07 0.9854675 44680.14 44680.14
## 2 2 34290.45 0.9873624 68580.90 68580.90
## 3 3 22239.83 0.9881766 44479.67 44479.67
## 4 4 37062.62 0.9844610 74125.24 74125.24
## 5 5 30442.81 0.9877376 60885.62 60885.62
#4(f) How does the value in part c) change if you multiply all the circumference values by -2?
covmultiminus<-data.frame(Tree=c(1:max(orange$Tree)),covmultiminus=c(1))
cormultiminus<-data.frame(Tree=c(1:max(orange$Tree)),cormultiminus=c(1))
for(i in 1:nrow(correlation)){
covmultiminus[i,2]<-cov(orange$age[orange$Tree==i],orange$circumference[orange$Tree==i]*-2)
cormultiminus[i,2]<-cor(orange$age[orange$Tree==i],orange$circumference[orange$Tree==i]*-2)
}
covcormultiminus<-merge(covacorr,covmultiminus,"Tree")
covcormultiminus<-merge(covcormultiminus,cormultiminus,"Tree")
print(covcormultiminus)
## Tree covarience correlation covmultiminus cormultiminus
## 1 1 22340.07 0.9854675 -44680.14 -0.9854675
## 2 2 34290.45 0.9873624 -68580.90 -0.9873624
## 3 3 22239.83 0.9881766 -44479.67 -0.9881766
## 4 4 37062.62 0.9844610 -74125.24 -0.9844610
## 5 5 30442.81 0.9877376 -60885.62 -0.9877376
# This question uses the sample of 5,000 Desert Houses from problem three. a) What is the median value? Is it larger or smaller than the mean?
#5(a)
setwd("C:/Users/Manjari/Desktop/Machine learning/Home Work Solutions")
desert <- read.csv("HW01pb3Desertdata.csv", header = FALSE)
names(desert)[1] <- "HousePrice"
mean(desert$HousePrice)
## [1] 144.0348
median(desert$HousePrice)
## [1] 89
#b) What does your answer to part a) suggest about the shape of the distribution (rightskewed or left-skewed)? Does the distribution have more weight at one end? Is there a longer tail at the other? The distribution is skewed to the right if there is a long tail to the right. That is if the mean is greater than the median, the distribution is skewed to the right. A few high numbers will pull the mean above the median.
#5(b)
hist(desert$HousePrice, breaks = 20)
mean(desert$HousePrice)
## [1] 144.0348
median(desert$HousePrice)
## [1] 89
#5(c) How does the median change if you add 10 (thousand dollars) to all the values?
desert <- read.csv("HW01pb3Desertdata.csv", header = FALSE)
names(desert)[1] <- "HousePrice"
mean_data<-mean(desert$HousePrice)
medium_data<-median(desert$HousePrice)
mean_dataplus<-mean_data+10000
medium_dataplus<-median_data+10000
#5(d) How does the median change if you multiply all the values by 2?
mean_dataMulti <-mean_data * 2
multi_dataMulti <-medium_data * 2