Part (a): Basic R Operations

In this section, we demonstrate the creation and summary of basic data structures in R, including vectors, matrices, and data frames.

1. Vector and its Summary

A vector is a basic data structure in R that holds a sequence of elements of the same type.

# Creating a simple vector of 5 numbers
my_vector <- c(10, 25, 30, 45, 50)
print(my_vector)
## [1] 10 25 30 45 50
# Summary of the vector
summary(my_vector)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      10      25      30      32      45      50

Interpretation: The summary provides the descriptive statistics for our vector. It shows the Minimum (10), Maximum (50), and the Median (30) of our numbers. These values help us understand the central tendency and the range of our data points.

2. Matrix and its Summary

A matrix is a collection of data elements arranged in a two-dimensional grid.

# Creating a matrix with 2 rows and 5 columns
my_matrix <- matrix(1:10, nrow = 2, ncol = 5)
print(my_matrix)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
# Summary of the matrix
summary(my_matrix)
##        V1             V2             V3             V4             V5       
##  Min.   :1.00   Min.   :3.00   Min.   :5.00   Min.   :7.00   Min.   : 9.00  
##  1st Qu.:1.25   1st Qu.:3.25   1st Qu.:5.25   1st Qu.:7.25   1st Qu.: 9.25  
##  Median :1.50   Median :3.50   Median :5.50   Median :7.50   Median : 9.50  
##  Mean   :1.50   Mean   :3.50   Mean   :5.50   Mean   :7.50   Mean   : 9.50  
##  3rd Qu.:1.75   3rd Qu.:3.75   3rd Qu.:5.75   3rd Qu.:7.75   3rd Qu.: 9.75  
##  Max.   :2.00   Max.   :4.00   Max.   :6.00   Max.   :8.00   Max.   :10.00

Interpretation: The summary provides statistics for each column of the matrix individually. Since our matrix has 5 columns, the summary shows the range and average for each column, indicating how the values are distributed across the rows.

3. Data Frame and its Summary

A data frame is a table where each column can contain different types of data. Per the project requirements, this data frame consists of 10 rows and 5 columns.

# Creating a data frame with 10 rows and 5 columns
student_data <- data.frame(
  ID = 1:10,
  Marks = c(85, 78, 92, 88, 76, 95, 89, 84, 91, 80),
  Attendance = c(90, 85, 95, 88, 80, 98, 92, 87, 94, 82),
  Age = c(20, 21, 19, 22, 20, 21, 19, 20, 22, 21),
  Gender = factor(rep(c("Male", "Female"), 5))
)

# Showing the data frame
print(student_data)
##    ID Marks Attendance Age Gender
## 1   1    85         90  20   Male
## 2   2    78         85  21 Female
## 3   3    92         95  19   Male
## 4   4    88         88  22 Female
## 5   5    76         80  20   Male
## 6   6    95         98  21 Female
## 7   7    89         92  19   Male
## 8   8    84         87  20 Female
## 9   9    91         94  22   Male
## 10 10    80         82  21 Female
# Summary of the data frame
summary(student_data)
##        ID            Marks        Attendance        Age          Gender 
##  Min.   : 1.00   Min.   :76.0   Min.   :80.0   Min.   :19.0   Female:5  
##  1st Qu.: 3.25   1st Qu.:81.0   1st Qu.:85.5   1st Qu.:20.0   Male  :5  
##  Median : 5.50   Median :86.5   Median :89.0   Median :20.5             
##  Mean   : 5.50   Mean   :85.8   Mean   :89.1   Mean   :20.5             
##  3rd Qu.: 7.75   3rd Qu.:90.5   3rd Qu.:93.5   3rd Qu.:21.0             
##  Max.   :10.00   Max.   :95.0   Max.   :98.0   Max.   :22.0

Interpretation of Summary: The summary() function calculates descriptive statistics for every column. For numeric columns like ‘Marks’, we see a mean of ~85.8. For categorical variables like ‘Gender’, it provides a count of 5 Males and 5 Females.

Part (b): Data Importing and Visualization

1. Process of Importing Data on Windows

To work with external data in Windows, we follow these steps:
1. Package Installation: We must install the package that can read the file type. For Excel files (.xls or .xlsx), we use install.packages(“readxl”).
2. Loading the Package: We use library(readxl) to activate the tools.
3. Path Specification: In R, we use forward slashes (/). If a file has no header row, we use col_names = FALSE and then assign names manually.
4. Handling No-Header Files: Since our .xls files are “straight data,” we refer to the .txt description files to know which column represents which variable and assign names using colnames().

2. Visualizations using Imported Data (wage1.xls)

We are importing the wage1.xls data. Since it lacks headers, we use the descriptions provided in WAGE1_description.txt.

library(readxl)
## Warning: package 'readxl' was built under R version 4.5.3
# Importing wage1.xls (No headers in file)
wage1_data <- read_excel("wage1.xls", col_names = FALSE)
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
## • `` -> `...17`
## • `` -> `...18`
## • `` -> `...19`
## • `` -> `...20`
## • `` -> `...21`
## • `` -> `...22`
## • `` -> `...23`
## • `` -> `...24`
# Assigning column names based on WAGE1_description.txt
colnames(wage1_data) <- c("wage", "educ", "exper", "tenure", "nonwhite", "female", 
                         "married", "numdep", "smsa", "northcen", "south", "west", 
                         "construc", "ndurman", "trcommpu", "trade", "services", 
                         "profserv", "profocc", "clerocc", "servocc", "lwage", 
                         "expersq", "tenursq")

# Summary for interpretation
summary(wage1_data$wage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.530   3.330   4.650   5.896   6.880  24.980
# 1. Histogram (Distribution of Hourly Wages)
hist(wage1_data$wage, col="lightgreen", main="Histogram of Hourly Wage", xlab="Wage ($ per hour)")

# 2. Box Plot (Education Level by Gender)
boxplot(educ ~ female, data = wage1_data, col="plum", 
        main="Box Plot: Education Level by Gender", 
        xlab="Gender (0=Male, 1=Female)", ylab="Years of Education")

# 3. Scatter Plot (Education vs Wage)
plot(wage1_data$educ, wage1_data$wage, main="Scatter Plot: Education vs Wage", 
     xlab="Years of Education", ylab="Hourly Wage ($)", pch=19, col="darkblue")

# 4. Frequency Density (Density of Hourly Wage)
plot(density(wage1_data$wage), main="Frequency Density of Hourly Wage", col="darkred", lwd=2)

Interpretation of Summary and Graphs:
1. Summary Interpretation: The imported wage data shows a distribution ranging from $0.53 to $24.98 per hour. The mean wage ($5.89) is higher than the median ($4.65), indicating a positive skew.
2. Histogram: Visually confirms that most workers earn lower wages, with a long tail on the right representing high earners.
3. Box Plot: Illustrates that median education levels are relatively consistent between genders in this specific sample.
4. Scatter Plot: Shows a clear positive correlation; as years of education increase, wages tend to rise.
5. Frequency Density: Shows the highest frequency of wages is concentrated around the $4-$5 per hour mark.

Part (c): Correlation and Simple Linear Regression

We are now importing the hprice1.xls dataset. We use HPRICE1_description.txt to name the columns.

# Importing hprice1.xls (No headers in file)
hprice_data <- read_excel("hprice1.xls", col_names = FALSE)
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
# Assigning column names based on HPRICE1_description.txt
colnames(hprice_data) <- c("price", "assess", "bdrms", "lotsize", "sqrft", 
                          "colonial", "lprice", "lassess", "llotsize", "lsqrft")

# 1. Correlation between Size (sqrft) and Price
correlation_value <- cor(hprice_data$sqrft, hprice_data$price)
print(paste("Correlation Coefficient:", round(correlation_value, 4)))
## [1] "Correlation Coefficient: 0.7879"

Interpretation: The correlation coefficient of 0.7879 indicates a strong, positive linear relationship between the size of the house (sqrft) and its price. As the square footage increases, the price of the house tends to increase significantly.

# 2. Simple Linear Regression Model
# price (Dependent) ~ sqrft (Independent)
simple_model <- lm(price ~ sqrft, data = hprice_data)
summary(simple_model)
## 
## Call:
## lm(formula = price ~ sqrft, data = hprice_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -117.112  -36.348   -6.503   31.701  235.253 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.20414   24.74261   0.453    0.652    
## sqrft        0.14021    0.01182  11.866   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.62 on 86 degrees of freedom
## Multiple R-squared:  0.6208, Adjusted R-squared:  0.6164 
## F-statistic: 140.8 on 1 and 86 DF,  p-value: < 2.2e-16

Interpretation of Results:
1. Intercept & Slope: The intercept is ~11.20, which theoretically implies that a house with 0 square feet would cost $11,200. This value anchors the regression line. The slope for ‘sqrft’ is ~0.14, meaning for every additional square foot, price increases by ~$140.
2. Variance: The variance of the parameter (standard error) is 24.74 for the intercept and 0.0118 for ‘sqrft’. The Residual Standard Error (63.62) measures the variance of the error term.
3. T-test & P-value: The p-value (< 2e-16) is extremely low, meaning ‘sqrft’ is a highly significant predictor.
4. Adj R-square: The Adjusted R-squared is 0.6164. Roughly 61.6% of price variation is explained by size.
5. F-test: The F-statistic (140.8) is significant, validating the overall model.

# Plotting the regression model on the graph
plot(hprice_data$sqrft, hprice_data$price, main="Simple Regression: Price vs Size", pch=16)
abline(simple_model, col="red", lwd=2)

Part (d): Multiple Regression Model Comparison

We expand the model by adding the Number of Bedrooms (bdrms).

# Multiple Regression Model: Price depends on Size and Bedrooms
multiple_model <- lm(price ~ sqrft + bdrms, data = hprice_data)
summary(multiple_model)
## 
## Call:
## lm(formula = price ~ sqrft + bdrms, data = hprice_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -127.627  -42.876   -7.051   32.589  229.003 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -19.31500   31.04662  -0.622    0.536    
## sqrft         0.12844    0.01382   9.291 1.39e-14 ***
## bdrms        15.19819    9.48352   1.603    0.113    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.04 on 85 degrees of freedom
## Multiple R-squared:  0.6319, Adjusted R-squared:  0.6233 
## F-statistic: 72.96 on 2 and 85 DF,  p-value: < 2.2e-16

Comparison and Interpretation

  1. Intercept & Slope: In the multiple model, the intercept is -19.31. The ‘sqrft’ slope is ~0.13. The ‘bdrms’ slope is 15.20, meaning an extra bedroom adds ~$15,200 to the house price, holding size constant.
  2. Variance comparison: The parameter variances (standard errors) are 31.05 for the intercept, 0.0138 for ‘sqrft’, and 9.48 for ‘bdrms’. The Residual Standard Error dropped from 63.62 to 63.04, meaning the multiple model has lower error variance.
  3. T-test: The ‘sqrft’ variable remains highly significant (p < 0.001), but the ‘bdrms’ variable is not statistically significant (p = 0.113) at the 5% level.
  4. R-Square Comparison: The Multiple R-Squared is 0.6319. The Adjusted R-Square rose from 0.6164 to 0.6233.
  5. F-test: The F-statistic (72.96) remains highly significant. Conclusion: While the Multiple Regression model technically provides a slightly better fit (higher Adjusted R-square), the bedroom variable is not statistically significant. Therefore, the simpler model may be preferred for its parsimony.