DS311 - R Lab Assignment

R Assignment 1

In this assignment, we are going to apply some of the build in data set in R for descriptive statistics analysis.
To earn full grade in this assignment, students need to complete the coding tasks for each question to get the result.
After finished all the questions, knit the document into HTML format for submission.

Question 1

Using the mtcars data set in R, please answer the following questions.

# Loading the data
data(mtcars)

# Head of the data set
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Report the number of variables and observations in the data set.

discrete_vars <- sapply(mtcars, function(x) is.factor(x) || is.character(x) || length(unique(x)) < 10)

continuous_vars <- sapply(mtcars, function(x) is.numeric(x) && length(unique(x)) >= 10)

num_discrete <- sum(discrete_vars)
num_continuous <- sum(continuous_vars)

print(paste("There are", num_discrete, "discrete variables and", num_continuous, "continuous variables in the dataset."))

## [1] "There are 5 discrete variables and 6 continuous variables in the dataset."

Print the summary statistics of the data set and report how many discrete and continuous variables are in the data set.

num_discrete <- sum(discrete_vars)
num_continuous <- sum(continuous_vars)

print(paste("There are", num_discrete, "discrete variables and", num_continuous, "continuous variables in the dataset."))

## [1] "There are 5 discrete variables and 6 continuous variables in the dataset."

Calculate the mean, variance, and standard deviation for the variable mpg and assign them into variable names m, v, and s. Report the results in the print statement.

m <- mean(mtcars$mpg)     
v <- var(mtcars$mpg)       
s <- sd(mtcars$mpg)

print(paste("The average of Miles Per Gallon from this data set is", round(m, 2), 
            "with variance", round(v, 2), 
            "and standard deviation", round(s, 2), "."))

## [1] "The average of Miles Per Gallon from this data set is 20.09 with variance 36.32 and standard deviation 6.03 ."

Create two tables to summarize 1) average mpg for each cylinder class and 2) the standard deviation of mpg for each gear class.

avg_mpg_by_cyl <- aggregate(mpg ~ cyl, data = mtcars, mean)

sd_mpg_by_gear <- aggregate(mpg ~ gear, data = mtcars, sd)

print("Table 1: Average MPG for Each Cylinder Class")

## [1] "Table 1: Average MPG for Each Cylinder Class"

print(avg_mpg_by_cyl)

##   cyl      mpg
## 1   4 26.66364
## 2   6 19.74286
## 3   8 15.10000

print(sd_mpg_by_gear)

##   gear      mpg
## 1    3 3.371618
## 2    4 5.276764
## 3    5 6.658979

Create a crosstab that shows the number of observations belong to each cylinder and gear class combinations. The table should show how many observations given the car has 4 cylinders with 3 gears, 4 cylinders with 4 gears, etc. Report which combination is recorded in this data set and how many observations for this type of car.

crosstab <- table(mtcars$cyl, mtcars$gear)
print("Crosstab of Cylinder and Gear Combinations:")

## [1] "Crosstab of Cylinder and Gear Combinations:"

print(crosstab)

##    
##      3  4  5
##   4  1  8  2
##   6  2  4  1
##   8 12  0  2

most_common_comb <- which(crosstab == max(crosstab), arr.ind = TRUE)
most_common_cyl <- rownames(crosstab)[most_common_comb[1, 1]]
most_common_gear <- colnames(crosstab)[most_common_comb[1, 2]]
most_common_count <- crosstab[most_common_comb[1, 1], most_common_comb[1, 2]]

print(paste("The most common car type in this data set is car with", most_common_cyl, 
            "cylinders and", most_common_gear, "gears. There are a total of", 
            most_common_count, "cars belonging to this specification in the data set."))

## [1] "The most common car type in this data set is car with 8 cylinders and 3 gears. There are a total of 12 cars belonging to this specification in the data set."

Question 2

Use different visualization tools to summarize the data sets in this question.

Using the PlantGrowth data set, visualize and compare the weight of the plant in the three separated group. Give labels to the title, x-axis, and y-axis on the graph. Write a paragraph to summarize your findings.

data("PlantGrowth")

head(PlantGrowth)

##   weight group
## 1   4.17  ctrl
## 2   5.58  ctrl
## 3   5.18  ctrl
## 4   6.11  ctrl
## 5   4.50  ctrl
## 6   4.61  ctrl

boxplot(weight ~ group, 
        data = PlantGrowth, 
        main = "Comparison of Plant Weights Across Groups",
        xlab = "Group",
        ylab = "Weight",
        col = c("lightblue", "lightgreen", "pink"))

grid()

Result:

=> Report a paragraph to summarize your findings from the plot!

From the boxplot, I noticed differences in plant weights among the three groups: ctrl, trt1, and trt2. The control group (ctrl) had a fairly consistent range of weights with a higher middle value compared to trt1. The trt1 group had the lowest middle value and the smallest spread, which suggests it was the least effective. On the other hand, trt2 had the highest middle value, showing that this treatment likely helped the plants grow the most. I also saw one unusual data point in the trt2 group. Overall, it looks like trt2 worked best for increasing plant weight.

Using the mtcars data set, plot the histogram for the column mpg with 10 breaks. Give labels to the title, x-axis, and y-axis on the graph. Report the most observed mpg class from the data set.

data(mtcars)

# Plot 
hist(mtcars$mpg, 
     breaks = 10, 
     main = "Histogram of Miles Per Gallon (mpg)",
     xlab = "Miles Per Gallon (mpg)", 
     ylab = "Frequency",
     col = "lightblue",
     border = "black")

grid()

#  most frequent mpg class
most_frequent_mpg <- cut(mtcars$mpg, breaks = 10)
table_mpg <- table(most_frequent_mpg)
most_common_class <- names(which.max(table_mpg))

print(paste("Most of the cars in this data set are in the class of", most_common_class, "miles per gallon."))

## [1] "Most of the cars in this data set are in the class of (15.1,17.5] miles per gallon."

Using the USArrests data set, create a pairs plot to display the correlations between the variables in the data set. Plot the scatter plot with Murder and Assault. Give labels to the title, x-axis, and y-axis on the graph. Write a paragraph to summarize your results from both plots.

data("USArrests")

head(USArrests)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

pairs(USArrests, 
      main = "Pairs Plot of USArrests Variables", 
      col = "lightblue", 
      pch = 19)

plot(USArrests$Murder, USArrests$Assault, 
     main = "Scatter Plot of Murder vs Assault", 
     xlab = "Murder Rate", 
     ylab = "Assault Rate", 
     pch = 19, 
     col = "darkblue")

grid()

Result:

=> Report a paragraph to summarize your findings from the plot!

The pairs plot highlights the relationships among the variables in the USArrests dataset, showing a strong positive correlation between Murder, Assault, and Rape, indicating that states with higher rates of one violent crime often exhibit higher rates of others. The scatter plot of Murder versus Assault further emphasizes this trend, as the points show a clear upward pattern, suggesting that states with a high murder rate also tend to have a high assault rate. This reinforces the idea that these violent crimes are likely interconnected, perhaps influenced by common underlying factors such as socioeconomic conditions or population density.

Question 3

Download the housing data set from www.jaredlander.com and find out what explains the housing prices in New York City.

Note: Check your working directory to make sure that you can download the data into the data folder.

##   Neighborhood Market.Value.per.SqFt      Boro Year.Built
## 1    FINANCIAL                200.00 Manhattan       1920
## 2    FINANCIAL                242.76 Manhattan       1985
## 4    FINANCIAL                271.23 Manhattan       1930
## 5      TRIBECA                247.48 Manhattan       1985
## 6      TRIBECA                191.37 Manhattan       1986
## 7      TRIBECA                211.53 Manhattan       1985

Create your own descriptive statistics and aggregation tables to summarize the data set and find any meaningful results between different variables in the data set.

head(housingData)

##   Neighborhood Market.Value.per.SqFt      Boro Year.Built
## 1    FINANCIAL                200.00 Manhattan       1920
## 2    FINANCIAL                242.76 Manhattan       1985
## 4    FINANCIAL                271.23 Manhattan       1930
## 5      TRIBECA                247.48 Manhattan       1985
## 6      TRIBECA                191.37 Manhattan       1986
## 7      TRIBECA                211.53 Manhattan       1985

summary(housingData)

##  Neighborhood       Market.Value.per.SqFt     Boro             Year.Built  
##  Length:2530        Min.   : 10.66        Length:2530        Min.   :1825  
##  Class :character   1st Qu.: 75.10        Class :character   1st Qu.:1926  
##  Mode  :character   Median :114.89        Mode  :character   Median :1986  
##                     Mean   :133.17                           Mean   :1967  
##                     3rd Qu.:189.91                           3rd Qu.:2005  
##                     Max.   :399.38                           Max.   :2010

mean_market_value <- mean(housingData$Market.Value.per.SqFt, na.rm = TRUE)
median_market_value <- median(housingData$Market.Value.per.SqFt, na.rm = TRUE)

cat("Mean Market Value per SqFt:", mean_market_value, "\n")

## Mean Market Value per SqFt: 133.1731

cat("Median Market Value per SqFt:", median_market_value, "\n")

## Median Market Value per SqFt: 114.89

avg_market_by_neighborhood <- aggregate(housingData$Market.Value.per.SqFt ~ Neighborhood, 
                                        data = housingData, 
                                        FUN = mean)
colnames(avg_market_by_neighborhood) <- c("Neighborhood", "Average Market Value per SqFt")
print(head(avg_market_by_neighborhood, 10))  # Display the top 10 neighborhoods

##            Neighborhood Average Market Value per SqFt
## 1         ALPHABET CITY                     148.35500
## 2  ARROCHAR-SHORE ACRES                      57.75000
## 3               ASTORIA                      91.48167
## 4            BATH BEACH                      70.34000
## 5             BAY RIDGE                      68.03500
## 6               BAYSIDE                      71.42111
## 7  BEDFORD PARK/NORWOOD                      38.24500
## 8    BEDFORD STUYVESANT                      83.24172
## 9               BELMONT                      56.45000
## 10          BENSONHURST                      71.70429

avg_market_by_boro <- aggregate(housingData$Market.Value.per.SqFt ~ Boro, 
                                data = housingData, 
                                FUN = mean)
colnames(avg_market_by_boro) <- c("Borough", "Average Market Value per SqFt")
print(avg_market_by_boro)

##         Borough Average Market Value per SqFt
## 1         Bronx                      47.93232
## 2      Brooklyn                      80.13439
## 3     Manhattan                     180.59265
## 4        Queens                      77.38137
## 5 Staten Island                      41.26958

median_year_by_neighborhood <- aggregate(housingData$Year.Built ~ Neighborhood, 
                                         data = housingData, 
                                         FUN = median)
colnames(median_year_by_neighborhood) <- c("Neighborhood", "Median Year Built")
print(head(median_year_by_neighborhood, 10))

##            Neighborhood Median Year Built
## 1         ALPHABET CITY            1999.0
## 2  ARROCHAR-SHORE ACRES            1987.0
## 3               ASTORIA            2006.0
## 4            BATH BEACH            2003.5
## 5             BAY RIDGE            1995.0
## 6               BAYSIDE            1983.0
## 7  BEDFORD PARK/NORWOOD            1980.5
## 8    BEDFORD STUYVESANT            2004.0
## 9               BELMONT            2007.0
## 10          BENSONHURST            2002.0

Create multiple plots to demonstrates the correlations between different variables. Remember to label all axes and give title to each graph.

library(ggplot2)

ggplot(housingData, aes(x = Year.Built, y = Market.Value.per.SqFt)) +
    geom_point(color = "blue", alpha = 0.6) +
    labs(title = "Market Value per SqFt vs Year Built",
         x = "Year Built",
         y = "Market Value per SqFt") +
    theme_minimal()

ggplot(housingData, aes(x = Boro, y = Market.Value.per.SqFt, fill = Boro)) +
    geom_boxplot() +
    labs(title = "Market Value per SqFt by Borough",
         x = "Borough",
         y = "Market Value per SqFt") +
    theme_minimal()

ggplot(housingData, aes(x = Market.Value.per.SqFt)) +
    geom_histogram(binwidth = 10, fill = "lightblue", color = "black") +
    labs(title = "Distribution of Market Value per SqFt",
         x = "Market Value per SqFt",
         y = "Frequency") +
    theme_minimal()

    library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

    numeric_vars <- housingData[, sapply(housingData, is.numeric)]
    ggpairs(numeric_vars,
            title = "Pairwise Relationships of Numeric Variables",
            upper = list(continuous = wrap("cor", size = 3)),
            lower = list(continuous = wrap("smooth")))

Write a summary about your findings from this exercise.

It seems that Manhattan has the best market value per square foot, at $180.59, compared to other areas like the Bronx ($47.93) and Staten Island ($41.27). It wasn’t a surprise, but the big gaps in the numbers really hit me. In Manhattan, places like Alphabet City have an average market value of up to $148.36 per square foot, while in the Bronx, some places only hit $38.24. The scatter plot of market value versus the year built showed an interesting trend: buildings that were built more recently tend to be priced higher, but not always. I wasn’t expecting that homes from the early 1900s would be so competitively priced.

It opened my eyes when I saw the boxplot that compared areas. In Manhattan, market prices range from about $100 to well over $400 per square foot. On Staten Island, however, the range is much smaller. There was also a clear order in median values: all areas had values of $114.89 per square foot, but Manhattan had values that were much higher. These numbers and pictures make it very clear that place affects prices. This activity made me think about how growth in cities affects real estate, and it made me want to look into similar patterns in other places!

DS311 - R Lab Assignment

Your Name

2024-11-26

R Assignment 1

Question 1

Question 2

Question 3