Project 1

Problem 1

Use the iris dataset to compute a 6-number summary (min, max, Q1, Q3, median, mean) for Sepal.Width for each type of iris (setosa, versicolor, virginica).

by(iris$Sepal.Width,iris$Species,summary) #computes 6-number summary

## iris$Species: setosa
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.300   3.200   3.400   3.428   3.675   4.400 
## ------------------------------------------------------------ 
## iris$Species: versicolor
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.525   2.800   2.770   3.000   3.400 
## ------------------------------------------------------------ 
## iris$Species: virginica
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.200   2.800   3.000   2.974   3.175   3.800

Problem 2

In the mtcars dataset, the number of cylinders (cyl) is numeric. Create a new variable cyl2 that is a factor variable and use it to make a boxplot of mpg categorized by the number of cylinders labeling the axes as “Number of Cylinders” and “Miles per Gallon”.

newmtcars = mtcars;#copying original mtcars into a new data set that I will change
cyl2<- as.factor(newmtcars$cyl)# changing cyl to factor variable
boxplot(mtcars$mpg~cyl2,xlab= "Number of Cylinders",ylab = "Miles Per Gallon")

is.factor(cyl2) #checking to make sure variable cyl2 has been changed to factor

## [1] TRUE

Problem 3

Take a random sample of size n = 75 from a uniform distribution on [0, 6] and compute the sample mean. Repeat this process 999 more times to produce a total of 1000 sample means. Compute the mean and standard deviation of the 1000 sample means. Are your results consistent with the central limit theorem? Explain.

mean(runif(75,min=0,max=6)) #computes sample mean of size 75

## [1] 2.876799

data3 <- replicate(1000,mean(runif(75,min=0,max=6))) #creates 1000 sample means
hist(data3,main = "Histogram of Means",xlab = "Means",ylab= "Frequency") #creates histogram

mean(data3) #computes mean of 1000 sample means

## [1] 3.008262

sd(data3) #computes standard deviation of 1000 sample means

## [1] 0.2015861

We know that a uniform distribution has a mean equal to \(\frac{a+b}{2}\) and a standard deviation of \(\sqrt\frac{(b-a)^2}{12}\). In this problem we have a uniform distribution on [0,6] which has a mean of\(\frac{(0+6)}{2} = 3\) and a standard deviation of \(\sqrt\frac{(6-0)^2}{12} =1.732\).

After taking a sample of size 75 and replicating it 1000 times we should find the average mean to be very close to the population mean of 3 and standard deviation to be very close to the population standard deviation divided by \(\sqrt{n}\) (i.e. \(\frac{1.732}{\sqrt{75}}\approx 0.2\)). We can also see that our histogram shows our sample means to be approximately normal, this is the work of the Central Limit Theorem.

Problem 4

Use read.csv() to import the data in T:/Faculty & Staff Alphabetical/M/McDevittt/Public/DAT315/Data/geese.txt into R. Make histograms of the logarithms of Aestimate and Bestimate and place them side-by-side in the same plot and label each using main.

geese <- read.csv("C:\\Users\\Sarah\\OneDrive\\DAT315\\Project 1\\geese.txt")#read in file

loga <- log(geese$Aestimate)#logarithms of Aestimate and Bestimate
logb <- log(geese$Bestimate)

par(mfrow = c(1,2))
hist(loga,main="log of Aestimate", xlab= "Log of Aestimate")#make histograms of each
hist(logb,main= "log of Bestimate", xlab="Log of Bestimate")

Problem 5

Use read.csv() with header=FALSE to import the data in T:/Faculty & Staff Alphabetical/M/McDevittt/Public/MA251/Data/CerealSugar1979.txt and T:/Faculty & Staff Alphabetical/M/McDevittt/Public/MA251/Data/CerealSugar2006.txt. Use names() to rename the columns Cereal and SugarContent, respectively. Then do a two-sample t-test to determine if the mean sugar content of cereals has changed from 1979 to 2006.

cerealsugar1979 <- read.csv("C:\\Users\\Sarah\\OneDrive\\DAT315\\Project 1\\CerealSugar1979.csv", header = FALSE)#read in both files
cerealsugar2006 <- read.csv("C:\\Users\\Sarah\\OneDrive\\DAT315\\Project 1\\CerealSugar2006.csv", header = FALSE)

names(cerealsugar1979) <- c("Cereal", "SugarContent")#rename columns
names(cerealsugar2006) <- c("Cereal", "SugarContent")
Sugar1979 <- (cerealsugar1979[,2])
Sugar2006 <- (cerealsugar2006[,2])
t.test(Sugar1979, Sugar2006, var.equal=TRUE, alternative="two.sided", mu=0, paired=FALSE, conf.level =.95)#two sample t test

## 
##  Two Sample t-test
## 
## data:  Sugar1979 and Sugar2006
## t = -0.40946, df = 109, p-value = 0.683
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.013194  4.611614
## sample estimates:
## mean of x mean of y 
##  26.76452  27.96531

Hypotheses

\(H_{o}\) :The sugar level in popular cereals has not changed from 1979 to 2006.

\(H_{a}\) : The sugar level in popular cereals has changed from 1979 to 2006.

\(\alpha\) = 0.05

The two sample t-test found the p-value to be 0.683. We fail to reject the null hypothesis, and the sugar level in popular cereals has likely not changed over from 1979 to 2006.

Problem 6

Use read.table() to import the data in T:/Faculty & Staff Alphabetical/M/McDevittt/Public/MA252/Data/PCText/Chapter 11/retail.txt. Build a linear model to predict Gross-Sales using Gross-Cash, Cash-Items, and Gross-Check as predictors. Interpret the R output to type the resulting equation for the model.

retail <- read.table("C:\\Users\\Sarah\\OneDrive\\DAT315\\Project 1\\retail.txt", header=TRUE)#read in file

linearmodel <- lm(Gross.Sales~Gross.Cash + Cash.Items + Gross.Check, data = retail)#build linear model

summary(linearmodel)#show summary of linear model made

## 
## Call:
## lm(formula = Gross.Sales ~ Gross.Cash + Cash.Items + Gross.Check, 
##     data = retail)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -85.96 -29.67  -5.64  29.02  93.42 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.92979   22.40870  -0.577   0.5701    
## Gross.Cash    0.08669    0.46887   0.185   0.8551    
## Cash.Items    7.62916    2.86987   2.658   0.0147 *  
## Gross.Check   1.16480    0.13066   8.915 1.39e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.68 on 21 degrees of freedom
## Multiple R-squared:  0.9307, Adjusted R-squared:  0.9208 
## F-statistic: 94.02 on 3 and 21 DF,  p-value: 2.462e-12

Equation: gross sales = -12.92979 + 0.008669(gross cash) + 7.62916(cash items) + 1.1648(gross check)

Problem 7

Kudzu is a plant that was imported to the United States from Japan and now covers over seven million acres in the South. The plant contains chemicals called isoflavones that have been shown to have beneficial effects on bones. One study at Purdue University used three groups of rats to compare a control group with rats that were fed either a low dose or a high dose of isoflavones from kudzu. One of the outcomes examined was the bone mineral density in the femur (in grams per square centimeter). Use the kudzu data file to do a one-way ANOVA and interpret the result.

kudzu <- read.delim("C:\\Users\\Sarah\\OneDrive\\DAT315\\Project 1\\data.kudzu.txt", header = TRUE) #reading the kudzu text file

Hypotheses

\(H_{o}\) :The mean of each treatment group of rats is the same.

\(H_{a}\) : The mean of each treatment group of rats is not the same.

\(\alpha\) = 0.05

Each groups standard deviation is not more than twice another, so assumptions are met to run a one-way Anova:

by(kudzu$BMD, kudzu$Treatment, sd) #Standarc Deviations of each group

## kudzu$Treatment: Control
## [1] 0.01158735
## ------------------------------------------------------------ 
## kudzu$Treatment: HighDose
## [1] 0.01877105
## ------------------------------------------------------------ 
## kudzu$Treatment: LowDose
## [1] 0.01151066

results <-aov(kudzu$BMD ~ kudzu$Treatment, data=kudzu) #running one-way anova
summary(results)

##                 Df   Sum Sq   Mean Sq F value Pr(>F)   
## kudzu$Treatment  2 0.003186 0.0015928   7.718 0.0014 **
## Residuals       42 0.008668 0.0002064                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is 0.0014 so we reject the null hypothesis. We conclude that there is significant evidence that the treatment level has an effect on the BMD.

Problem 8

Download the .csv file from https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv2 and use R to determine the number of observations for which VAL= 24.

downloadeddata <- download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv","C:\\Users\\Sarah\\OneDrive\\DAT315\\data9.txt")
file.info("C:\\Users\\Sarah\\OneDrive\\DAT315\\data9.txt")$ctime #Finds date and time file was downloaded

## [1] "2020-02-04 16:48:17 EST"

downloadeddata <- read.csv("C:\\Users\\Sarah\\OneDrive\\DAT315\\Project 1\\data9.txt", header=FALSE)
sum(downloadeddata$V37==24) #Goes to column VAL and sum's the number of observations for which VAL=24

## [1] 53

Problem 9

Use the data in T:/Faculty & Staff Alphabetical/M/McDevittt/Public/MA463/MONDIAL.accdb to produce a histogram for the logarithm of the population density of every country (log(number of people/unit area)).

MondialCountry <- read.csv("C:\\Users\\Sarah\\OneDrive\\DAT315\\Project 1\\Country.csv", header=TRUE)
Population <- (MondialCountry$Population)
Area <- (MondialCountry$Area)
LogofPopData <- log10(Population/Area) #computes the logarithm of the population density of every country
hist(LogofPopData, main= "Frequency Distribution of Population Density", xlab= "Population Density")

Problem 10

Use map() from the maps package to make a map of the world. Then use points() to add the cities in T:/Faculty & Staff Alphabetical/M/McDevittt/Public/MA463/MONDIAL.accdb to the map. Your graph should look something like the following.

warning=FALSE #gave me error about updating R to 3.6.2
City1 <- read.csv("C:\\Users\\Sarah\\OneDrive\\DAT315\\Project 1\\City1.csv", header=TRUE)#read in file
library(maps)
map(database = "world", fill=TRUE, col="white")
points(City1$Longitude,City1$Latitude, col="red") #plots points