Using the mtcars data set in R, please answer the following questions.
# Loading the data
data(mtcars)
# Head of the data set
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
discrete_vars <- sapply(mtcars, function(x) is.factor(x) || is.character(x) || length(unique(x)) < 10)
continuous_vars <- sapply(mtcars, function(x) is.numeric(x) && length(unique(x)) >= 10)
num_discrete <- sum(discrete_vars)
num_continuous <- sum(continuous_vars)
print(paste("There are", num_discrete, "discrete variables and", num_continuous, "continuous variables in the dataset."))
## [1] "There are 5 discrete variables and 6 continuous variables in the dataset."
num_discrete <- sum(discrete_vars)
num_continuous <- sum(continuous_vars)
print(paste("There are", num_discrete, "discrete variables and", num_continuous, "continuous variables in the dataset."))
## [1] "There are 5 discrete variables and 6 continuous variables in the dataset."
m <- mean(mtcars$mpg)
v <- var(mtcars$mpg)
s <- sd(mtcars$mpg)
print(paste("The average of Miles Per Gallon from this data set is", round(m, 2),
"with variance", round(v, 2),
"and standard deviation", round(s, 2), "."))
## [1] "The average of Miles Per Gallon from this data set is 20.09 with variance 36.32 and standard deviation 6.03 ."
avg_mpg_by_cyl <- aggregate(mpg ~ cyl, data = mtcars, mean)
sd_mpg_by_gear <- aggregate(mpg ~ gear, data = mtcars, sd)
print("Table 1: Average MPG for Each Cylinder Class")
## [1] "Table 1: Average MPG for Each Cylinder Class"
print(avg_mpg_by_cyl)
## cyl mpg
## 1 4 26.66364
## 2 6 19.74286
## 3 8 15.10000
print(sd_mpg_by_gear)
## gear mpg
## 1 3 3.371618
## 2 4 5.276764
## 3 5 6.658979
crosstab <- table(mtcars$cyl, mtcars$gear)
print("Crosstab of Cylinder and Gear Combinations:")
## [1] "Crosstab of Cylinder and Gear Combinations:"
print(crosstab)
##
## 3 4 5
## 4 1 8 2
## 6 2 4 1
## 8 12 0 2
most_common_comb <- which(crosstab == max(crosstab), arr.ind = TRUE)
most_common_cyl <- rownames(crosstab)[most_common_comb[1, 1]]
most_common_gear <- colnames(crosstab)[most_common_comb[1, 2]]
most_common_count <- crosstab[most_common_comb[1, 1], most_common_comb[1, 2]]
print(paste("The most common car type in this data set is car with", most_common_cyl,
"cylinders and", most_common_gear, "gears. There are a total of",
most_common_count, "cars belonging to this specification in the data set."))
## [1] "The most common car type in this data set is car with 8 cylinders and 3 gears. There are a total of 12 cars belonging to this specification in the data set."
Use different visualization tools to summarize the data sets in this question.
data("PlantGrowth")
head(PlantGrowth)
## weight group
## 1 4.17 ctrl
## 2 5.58 ctrl
## 3 5.18 ctrl
## 4 6.11 ctrl
## 5 4.50 ctrl
## 6 4.61 ctrl
boxplot(weight ~ group,
data = PlantGrowth,
main = "Comparison of Plant Weights Across Groups",
xlab = "Group",
ylab = "Weight",
col = c("lightblue", "lightgreen", "pink"))
grid()
Result:
=> Report a paragraph to summarize your findings from the plot!
From the boxplot, I noticed differences in plant weights among the
three groups: ctrl, trt1, and
trt2. The control group (ctrl) had a fairly
consistent range of weights with a higher middle value compared to
trt1. The trt1 group had the lowest middle
value and the smallest spread, which suggests it was the least
effective. On the other hand, trt2 had the highest middle
value, showing that this treatment likely helped the plants grow the
most. I also saw one unusual data point in the trt2 group.
Overall, it looks like trt2 worked best for increasing
plant weight.
data(mtcars)
# Plot
hist(mtcars$mpg,
breaks = 10,
main = "Histogram of Miles Per Gallon (mpg)",
xlab = "Miles Per Gallon (mpg)",
ylab = "Frequency",
col = "lightblue",
border = "black")
grid()
# most frequent mpg class
most_frequent_mpg <- cut(mtcars$mpg, breaks = 10)
table_mpg <- table(most_frequent_mpg)
most_common_class <- names(which.max(table_mpg))
print(paste("Most of the cars in this data set are in the class of", most_common_class, "miles per gallon."))
## [1] "Most of the cars in this data set are in the class of (15.1,17.5] miles per gallon."
data("USArrests")
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
pairs(USArrests,
main = "Pairs Plot of USArrests Variables",
col = "lightblue",
pch = 19)
plot(USArrests$Murder, USArrests$Assault,
main = "Scatter Plot of Murder vs Assault",
xlab = "Murder Rate",
ylab = "Assault Rate",
pch = 19,
col = "darkblue")
grid()
Result:
=> Report a paragraph to summarize your findings from the plot!
The pairs plot highlights the relationships among the variables in the USArrests dataset, showing a strong positive correlation between Murder, Assault, and Rape, indicating that states with higher rates of one violent crime often exhibit higher rates of others. The scatter plot of Murder versus Assault further emphasizes this trend, as the points show a clear upward pattern, suggesting that states with a high murder rate also tend to have a high assault rate. This reinforces the idea that these violent crimes are likely interconnected, perhaps influenced by common underlying factors such as socioeconomic conditions or population density.
Download the housing data set from www.jaredlander.com and find out what explains the housing prices in New York City.
Note: Check your working directory to make sure that you can download the data into the data folder.
## Neighborhood Market.Value.per.SqFt Boro Year.Built
## 1 FINANCIAL 200.00 Manhattan 1920
## 2 FINANCIAL 242.76 Manhattan 1985
## 4 FINANCIAL 271.23 Manhattan 1930
## 5 TRIBECA 247.48 Manhattan 1985
## 6 TRIBECA 191.37 Manhattan 1986
## 7 TRIBECA 211.53 Manhattan 1985
head(housingData)
## Neighborhood Market.Value.per.SqFt Boro Year.Built
## 1 FINANCIAL 200.00 Manhattan 1920
## 2 FINANCIAL 242.76 Manhattan 1985
## 4 FINANCIAL 271.23 Manhattan 1930
## 5 TRIBECA 247.48 Manhattan 1985
## 6 TRIBECA 191.37 Manhattan 1986
## 7 TRIBECA 211.53 Manhattan 1985
summary(housingData)
## Neighborhood Market.Value.per.SqFt Boro Year.Built
## Length:2530 Min. : 10.66 Length:2530 Min. :1825
## Class :character 1st Qu.: 75.10 Class :character 1st Qu.:1926
## Mode :character Median :114.89 Mode :character Median :1986
## Mean :133.17 Mean :1967
## 3rd Qu.:189.91 3rd Qu.:2005
## Max. :399.38 Max. :2010
mean_market_value <- mean(housingData$Market.Value.per.SqFt, na.rm = TRUE)
median_market_value <- median(housingData$Market.Value.per.SqFt, na.rm = TRUE)
cat("Mean Market Value per SqFt:", mean_market_value, "\n")
## Mean Market Value per SqFt: 133.1731
cat("Median Market Value per SqFt:", median_market_value, "\n")
## Median Market Value per SqFt: 114.89
avg_market_by_neighborhood <- aggregate(housingData$Market.Value.per.SqFt ~ Neighborhood,
data = housingData,
FUN = mean)
colnames(avg_market_by_neighborhood) <- c("Neighborhood", "Average Market Value per SqFt")
print(head(avg_market_by_neighborhood, 10)) # Display the top 10 neighborhoods
## Neighborhood Average Market Value per SqFt
## 1 ALPHABET CITY 148.35500
## 2 ARROCHAR-SHORE ACRES 57.75000
## 3 ASTORIA 91.48167
## 4 BATH BEACH 70.34000
## 5 BAY RIDGE 68.03500
## 6 BAYSIDE 71.42111
## 7 BEDFORD PARK/NORWOOD 38.24500
## 8 BEDFORD STUYVESANT 83.24172
## 9 BELMONT 56.45000
## 10 BENSONHURST 71.70429
avg_market_by_boro <- aggregate(housingData$Market.Value.per.SqFt ~ Boro,
data = housingData,
FUN = mean)
colnames(avg_market_by_boro) <- c("Borough", "Average Market Value per SqFt")
print(avg_market_by_boro)
## Borough Average Market Value per SqFt
## 1 Bronx 47.93232
## 2 Brooklyn 80.13439
## 3 Manhattan 180.59265
## 4 Queens 77.38137
## 5 Staten Island 41.26958
median_year_by_neighborhood <- aggregate(housingData$Year.Built ~ Neighborhood,
data = housingData,
FUN = median)
colnames(median_year_by_neighborhood) <- c("Neighborhood", "Median Year Built")
print(head(median_year_by_neighborhood, 10))
## Neighborhood Median Year Built
## 1 ALPHABET CITY 1999.0
## 2 ARROCHAR-SHORE ACRES 1987.0
## 3 ASTORIA 2006.0
## 4 BATH BEACH 2003.5
## 5 BAY RIDGE 1995.0
## 6 BAYSIDE 1983.0
## 7 BEDFORD PARK/NORWOOD 1980.5
## 8 BEDFORD STUYVESANT 2004.0
## 9 BELMONT 2007.0
## 10 BENSONHURST 2002.0
library(ggplot2)
ggplot(housingData, aes(x = Year.Built, y = Market.Value.per.SqFt)) +
geom_point(color = "blue", alpha = 0.6) +
labs(title = "Market Value per SqFt vs Year Built",
x = "Year Built",
y = "Market Value per SqFt") +
theme_minimal()
ggplot(housingData, aes(x = Boro, y = Market.Value.per.SqFt, fill = Boro)) +
geom_boxplot() +
labs(title = "Market Value per SqFt by Borough",
x = "Borough",
y = "Market Value per SqFt") +
theme_minimal()
ggplot(housingData, aes(x = Market.Value.per.SqFt)) +
geom_histogram(binwidth = 10, fill = "lightblue", color = "black") +
labs(title = "Distribution of Market Value per SqFt",
x = "Market Value per SqFt",
y = "Frequency") +
theme_minimal()
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
numeric_vars <- housingData[, sapply(housingData, is.numeric)]
ggpairs(numeric_vars,
title = "Pairwise Relationships of Numeric Variables",
upper = list(continuous = wrap("cor", size = 3)),
lower = list(continuous = wrap("smooth")))
It seems that Manhattan has the best market value per square foot, at $180.59, compared to other areas like the Bronx ($47.93) and Staten Island ($41.27). It wasn’t a surprise, but the big gaps in the numbers really hit me. In Manhattan, places like Alphabet City have an average market value of up to $148.36 per square foot, while in the Bronx, some places only hit $38.24. The scatter plot of market value versus the year built showed an interesting trend: buildings that were built more recently tend to be priced higher, but not always. I wasn’t expecting that homes from the early 1900s would be so competitively priced.
It opened my eyes when I saw the boxplot that compared areas. In Manhattan, market prices range from about $100 to well over $400 per square foot. On Staten Island, however, the range is much smaller. There was also a clear order in median values: all areas had values of $114.89 per square foot, but Manhattan had values that were much higher. These numbers and pictures make it very clear that place affects prices. This activity made me think about how growth in cities affects real estate, and it made me want to look into similar patterns in other places!