This is a programming assignment.
Part A (40 marks) The file household.csv contains (fictional) data from a survey of 500 randomly selected households.
## [1] "integer " "integer " "integer " "integer " "integer " "integer "
## [7] "integer " "integer " "integer "
Norminal: Location-the location is only be labeled, 1 to 4 do not have any size or extent meaning, they are just numbers to point to places; Ownership-it is a variable showing whether the house is owned by a family, 1 means the family own the house, 0 means the family does not own the house.
Ordinal: Family.Size-it shows how many people in a family, and its size is from 1 to 10, 1 is the smallest size of a family and 10 is the largest, indicating it’s a ordinal variable.
Histograms for Family Size/First Income/Second Income/Monthly Payment are skewed to right.
Histogram for Utilities is neither symmetric nor skewed.
Histogram for Debt is approximately symmetric.
hist(hhs$Family.Size, main="Histogram for Family Size",
xlab="Family Size", xlim=c(1,6))
hist(hhs$First.Income, main = "Histogram for First Income", xlab = "First Income")
hist(hhs$Second.Income, main = "Histogram for Second Income", xlab = "Second Income" )
hist(hhs$Monthly.Payment, main = "Histogram for Monthly Payment", xlab = "Monthly Payment" )
hist(hhs$Utilities, main = "Histogram for Utilities", xlab = "Utilities" )
hist(hhs$Debt, main = "Histogram for Debt", xlab = "Debt" )
## [1] 8877
## [1] "2948.5 4267.5 5675.5"
## [1] 2727
The histograms for Family Size/First Income/Second Income/Monthly Payment are skewed to right, showing that here are some outliers for these numeric variables, which means there are some family having high first income, second income and monthly payment. Most of the families in the dataset have the family size fallen in 1 to 4, the first income falls in 30000-50000 dollars and the second income is in the range of 20000-40000 dollars, and the monthly payment falls in 600-1000 dollars.
However, histogram for Debt is approximately symmetric, indicating though the income level or monthly payment level are quite different, families with high income value might not intend to have more debts. And the quantiles of 25th, 50th , and 75th showing the debt value mainly falls in $3000-5500. Histogram for Utilities is neither symmetric nor skewed. It has 2 peaks, one is around 200-210 dollars, the other is around 250-260 dollars, indicating for utilities, families can be divided into 2 parts. And this data might need further action, such as transformation, for regression analysis.
Part B (40 marks) The file SupermarketTransactions.csv contains data on over 14.000 transactions. There are two numerical variables, Units Sold and Revenue. The first of these is discrete and the second is continuous. For each of the following, do whatever it takes to create a bar chart of counts for Units Sold and a histogram of Revenue for the given subpopulation of purchases.
smt <- data.frame(read.csv("SupermarketTransactions.csv", header = T))
smtJF <- subset(smt, as.Date(smt$Purchase.Date, "%m/%d/%Y") >= "2008-01-01"
& as.Date(smt$Purchase.Date, "%m/%d/%Y") <= "2008-02-29",
select = c(Units.Sold, Revenue))
barplot(summary(factor(smtJF$Units.Sold)), xlab = "Units.Sold", ylab="Frequency", main = "Units.Sold during January and February of 2008")
hist(smtJF$Revenue, xlab = "Revenue", ylab = "Frequency", main = "Revenue during January and February of 2008")
smtMF <- subset(smt, Gender == "F" & Marital.Status == "M", select = c(Units.Sold, Revenue))
barplot(summary(factor(smtMF$Units.Sold)),xlab = "Units.Sold", ylab="Frequency", main = "Units.Sold by married female homeowners")
hist(smtMF$Revenue, xlab = "Revenue", ylab = "Frequency", main = "Revenue by married female homeowners")
smtSC <- subset(smt, State.or.Province == "CA", select = c(Units.Sold, Revenue))
barplot(summary(factor(smtSC$Units.Sold)),xlab = "Units.Sold", ylab="Frequency", main = "Units.Sold made in the state of California")
hist(smtSC$Revenue, xlab = "Revenue", ylab = "Frequency", main = "Revenue made in the state of California")
smtPD <- subset(smt, Product.Department == "Produce", select = c(Units.Sold, Revenue))
barplot(summary(factor(smtPD$Units.Sold)),xlab = "Units.Sold", ylab="Frequency", main = "Units.Sold made in the Produce product department")
hist(smtPD$Revenue, xlab = "Revenue", ylab = "Frequency", main = "Revenue made in the Produce product department")
Write a report that is less than 250 words that summarizes your analysis.
The distributions about revenue of Supermarket Transactions are all right skewed, indicating most of its transactions’ revenue is around $10, small-size transactions take a large proportion of its revenue. It also indicates that if further analysis is required, the data might need non-linear transformation to show the relationship between the variable and response.
The distributions of Units.Sold are quite symmetric, and most of the Units.Solds of transactions fall in 3-5. No matter which subset we use to create the distribution of Units.Sold, they all show a similar distribution pattern, indicating that the Units.Solds of transactions will hardly vary from one segment to another.
Part C All of you must have heard about the central limit theorem (CLT).
rnorm2 <- function(n,mean,sd) {
mean+sd*scale(rnorm(n))
}
set.seed(1239)
r1 <- rnorm2(100,25,4)
r2 <- rnorm2(50,10,3)
samplingframe <- c(r1,r2)
hist(samplingframe, breaks=20,col = "pink")
Please describe the distribution that you obtain.
set.seed(1239)
pr1 <- matrix(0, nrow = 15, ncol = 50)
for (i in 1 : 50 ){
sampr <- sample(samplingframe, size = 15, replace = TRUE)
pr1[ , i] = pr1[ , i] + sampr
}
prmeans1 <- apply(pr1, 2, mean)
prmeans1
## [1] 21.60520 20.37256 25.22757 20.62299 21.23132 20.63834 15.36655
## [8] 18.08487 19.03047 17.33072 19.89736 25.02741 22.67489 19.26415
## [15] 20.69212 23.70095 20.08315 18.01933 19.92811 21.26000 16.08441
## [22] 18.78280 14.59421 21.69748 21.91940 19.07384 17.26868 19.82363
## [29] 19.77218 17.94577 19.97383 22.51172 21.11355 25.01786 18.05207
## [36] 17.74380 19.57222 23.59518 19.95549 18.41675 17.84621 21.15255
## [43] 19.95297 22.24467 19.35886 16.59597 17.91641 20.26703 18.33284
## [50] 20.53251
hist(prmeans1, main = "50 means of size 15", breaks=20, col = "grey")
set.seed(1239)
pr2 <- matrix(0, nrow = 45, ncol = 50)
for (i in 1 : 50 ){
sampr <- sample(samplingframe, size = 45, replace = TRUE)
pr2[ , i] = pr2[ , i] + sampr
}
prmeans2 <- apply(pr2, 2, mean)
prmeans2
## [1] 22.40177 20.83088 17.49397 20.75183 20.87705 20.60115 19.09084
## [8] 18.35816 19.42064 19.18053 21.19970 20.27124 21.04097 19.13851
## [15] 20.51883 18.25980 20.38829 20.02673 19.26421 17.92795 20.66285
## [22] 19.06719 20.49026 22.17356 20.41382 18.57478 16.98640 22.44220
## [29] 19.98459 18.85672 19.32652 21.65991 18.43538 21.45618 20.45493
## [36] 20.92773 19.31540 19.28337 20.64921 18.72511 19.95125 20.02142
## [43] 18.63121 17.95892 20.37356 20.57190 20.69279 20.25171 22.09495
## [50] 21.05096
hist(prmeans2, main = "50 means of size 45", breaks=20, col = "black")
hist(prmeans1, breaks = 20, col = "grey", main = "combination of parts b and c", xlab = "means of parts b and c")
hist(prmeans2, breaks=20, col = "black", add = T)
Explain the three histograms in terms of their differences and similarities.
paste0("samplemean = ", round(mean(samplingframe), 2), " samplemean1 = ", round(mean(prmeans1),2), " samplemean2 = ", round(mean(prmeans2),2) )
## [1] "samplemean = 20 samplemean1 = 19.94 samplemean2 = 19.97"
paste0("samplesd = ", round(sd(samplingframe),2), " samplesd1 = ", round(sd(prmeans1),2), " samplesd2 = ", round(sd(prmeans2),2))
## [1] "samplesd = 8 samplesd1 = 2.35 samplesd2 = 1.28"
Similarity: mean of the distribution.
Difference: 1. Standard deviation; 2. Distribution.
From the graph above, the distributions of a, b and c share the similar mean but their standard deviations indicate the extent of deviations of 3 distributions are quite different (a>b>c). With the increase of sample size, the standard deviation becomes smaller, and is around sd/root(n). Also, compare 2 new distribution, c tends to be more normally distributed than b.
Given a random distribution and do sampling from it, the larger the sample size/ sample times is, the more the new distribution can be normally distributed. With a sufficiently large sample size, its mean will be same as the original distribution and its standard deviation will be sd/root(n).
Yes, a gives a random distribution, ask us to use this distribution to do sampling and create 2 new distributions. With the increase of sample size, the more the distribution can be normally distributed and the smaller the deviation extent can be, indicating the importance of CLT in statistics analysis.