In this section, we demonstrate the creation and summary of basic data structures in R, including vectors, matrices, and data frames.
A vector is a basic data structure in R that holds a sequence of elements of the same type.
# Creating a simple vector of 5 numbers
my_vector <- c(10, 25, 30, 45, 50)
print(my_vector)
## [1] 10 25 30 45 50
# Summary of the vector
summary(my_vector)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10 25 30 32 45 50
Interpretation: The summary provides the descriptive statistics for our vector. It shows the Minimum (10), Maximum (50), and the Median (30) of our numbers. These values help us understand the central tendency and the range of our data points.
A matrix is a collection of data elements arranged in a two-dimensional grid.
# Creating a matrix with 2 rows and 5 columns
my_matrix <- matrix(1:10, nrow = 2, ncol = 5)
print(my_matrix)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
# Summary of the matrix
summary(my_matrix)
## V1 V2 V3 V4 V5
## Min. :1.00 Min. :3.00 Min. :5.00 Min. :7.00 Min. : 9.00
## 1st Qu.:1.25 1st Qu.:3.25 1st Qu.:5.25 1st Qu.:7.25 1st Qu.: 9.25
## Median :1.50 Median :3.50 Median :5.50 Median :7.50 Median : 9.50
## Mean :1.50 Mean :3.50 Mean :5.50 Mean :7.50 Mean : 9.50
## 3rd Qu.:1.75 3rd Qu.:3.75 3rd Qu.:5.75 3rd Qu.:7.75 3rd Qu.: 9.75
## Max. :2.00 Max. :4.00 Max. :6.00 Max. :8.00 Max. :10.00
Interpretation: The summary provides statistics for each column of the matrix individually. Since our matrix has 5 columns, the summary shows the range and average for each column, indicating how the values are distributed across the rows.
A data frame is a table where each column can contain different types of data. Per the project requirements, this data frame consists of 10 rows and 5 columns.
# Creating a data frame with 10 rows and 5 columns
student_data <- data.frame(
ID = 1:10,
Marks = c(85, 78, 92, 88, 76, 95, 89, 84, 91, 80),
Attendance = c(90, 85, 95, 88, 80, 98, 92, 87, 94, 82),
Age = c(20, 21, 19, 22, 20, 21, 19, 20, 22, 21),
Gender = factor(rep(c("Male", "Female"), 5))
)
# Showing the data frame
print(student_data)
## ID Marks Attendance Age Gender
## 1 1 85 90 20 Male
## 2 2 78 85 21 Female
## 3 3 92 95 19 Male
## 4 4 88 88 22 Female
## 5 5 76 80 20 Male
## 6 6 95 98 21 Female
## 7 7 89 92 19 Male
## 8 8 84 87 20 Female
## 9 9 91 94 22 Male
## 10 10 80 82 21 Female
# Summary of the data frame
summary(student_data)
## ID Marks Attendance Age Gender
## Min. : 1.00 Min. :76.0 Min. :80.0 Min. :19.0 Female:5
## 1st Qu.: 3.25 1st Qu.:81.0 1st Qu.:85.5 1st Qu.:20.0 Male :5
## Median : 5.50 Median :86.5 Median :89.0 Median :20.5
## Mean : 5.50 Mean :85.8 Mean :89.1 Mean :20.5
## 3rd Qu.: 7.75 3rd Qu.:90.5 3rd Qu.:93.5 3rd Qu.:21.0
## Max. :10.00 Max. :95.0 Max. :98.0 Max. :22.0
Interpretation of Summary: The summary() function calculates descriptive statistics for every column. For numeric columns like ‘Marks’, we see a mean of ~85.8. For categorical variables like ‘Gender’, it provides a count of 5 Males and 5 Females.
To work with external data in Windows, we follow these steps:
1. Package Installation: We must install the package
that can read the file type. For Excel files (.xls or .xlsx), we use
install.packages(“readxl”).
2. Loading the Package: We use library(readxl) to
activate the tools.
3. Path Specification: In R, we use forward slashes
(/). If a file has no header row, we use col_names = FALSE and then
assign names manually.
4. Handling No-Header Files: Since our .xls files are
“straight data,” we refer to the .txt description files to know which
column represents which variable and assign names using colnames().
We are importing the wage1.xls data. Since it lacks headers, we use the descriptions provided in WAGE1_description.txt.
library(readxl)
## Warning: package 'readxl' was built under R version 4.5.3
# Importing wage1.xls (No headers in file)
wage1_data <- read_excel("wage1.xls", col_names = FALSE)
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
## • `` -> `...17`
## • `` -> `...18`
## • `` -> `...19`
## • `` -> `...20`
## • `` -> `...21`
## • `` -> `...22`
## • `` -> `...23`
## • `` -> `...24`
# Assigning column names based on WAGE1_description.txt
colnames(wage1_data) <- c("wage", "educ", "exper", "tenure", "nonwhite", "female",
"married", "numdep", "smsa", "northcen", "south", "west",
"construc", "ndurman", "trcommpu", "trade", "services",
"profserv", "profocc", "clerocc", "servocc", "lwage",
"expersq", "tenursq")
# Summary for interpretation
summary(wage1_data$wage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.530 3.330 4.650 5.896 6.880 24.980
# 1. Histogram (Distribution of Hourly Wages)
hist(wage1_data$wage, col="lightgreen", main="Histogram of Hourly Wage", xlab="Wage ($ per hour)")
# 2. Box Plot (Education Level by Gender)
boxplot(educ ~ female, data = wage1_data, col="plum",
main="Box Plot: Education Level by Gender",
xlab="Gender (0=Male, 1=Female)", ylab="Years of Education")
# 3. Scatter Plot (Education vs Wage)
plot(wage1_data$educ, wage1_data$wage, main="Scatter Plot: Education vs Wage",
xlab="Years of Education", ylab="Hourly Wage ($)", pch=19, col="darkblue")
# 4. Frequency Density (Density of Hourly Wage)
plot(density(wage1_data$wage), main="Frequency Density of Hourly Wage", col="darkred", lwd=2)
Interpretation of Summary and Graphs:
1. Summary Interpretation: The imported wage data shows
a distribution ranging from $0.53 to $24.98 per hour. The mean wage
($5.89) is higher than the median ($4.65), indicating a positive
skew.
2. Histogram: Visually confirms that most workers earn
lower wages, with a long tail on the right representing high
earners.
3. Box Plot: Illustrates that median education levels
are relatively consistent between genders in this specific sample.
4. Scatter Plot: Shows a clear positive correlation; as
years of education increase, wages tend to rise.
5. Frequency Density: Shows the highest frequency of
wages is concentrated around the $4-$5 per hour mark.
We are now importing the hprice1.xls dataset. We use HPRICE1_description.txt to name the columns.
# Importing hprice1.xls (No headers in file)
hprice_data <- read_excel("hprice1.xls", col_names = FALSE)
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
# Assigning column names based on HPRICE1_description.txt
colnames(hprice_data) <- c("price", "assess", "bdrms", "lotsize", "sqrft",
"colonial", "lprice", "lassess", "llotsize", "lsqrft")
# 1. Correlation between Size (sqrft) and Price
correlation_value <- cor(hprice_data$sqrft, hprice_data$price)
print(paste("Correlation Coefficient:", round(correlation_value, 4)))
## [1] "Correlation Coefficient: 0.7879"
Interpretation: The correlation coefficient of 0.7879 indicates a strong, positive linear relationship between the size of the house (sqrft) and its price. As the square footage increases, the price of the house tends to increase significantly.
# 2. Simple Linear Regression Model
# price (Dependent) ~ sqrft (Independent)
simple_model <- lm(price ~ sqrft, data = hprice_data)
summary(simple_model)
##
## Call:
## lm(formula = price ~ sqrft, data = hprice_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -117.112 -36.348 -6.503 31.701 235.253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.20414 24.74261 0.453 0.652
## sqrft 0.14021 0.01182 11.866 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.62 on 86 degrees of freedom
## Multiple R-squared: 0.6208, Adjusted R-squared: 0.6164
## F-statistic: 140.8 on 1 and 86 DF, p-value: < 2.2e-16
Interpretation of Results:
1. Intercept & Slope: The intercept is ~11.20,
which theoretically implies that a house with 0 square feet would cost
$11,200. This value anchors the regression line. The slope for ‘sqrft’
is ~0.14, meaning for every additional square foot, price increases by
~$140.
2. Variance: The variance of the parameter (standard
error) is 24.74 for the intercept and 0.0118 for ‘sqrft’. The Residual
Standard Error (63.62) measures the variance of the error term.
3. T-test & P-value: The p-value (< 2e-16) is
extremely low, meaning ‘sqrft’ is a highly significant predictor.
4. Adj R-square: The Adjusted R-squared is 0.6164.
Roughly 61.6% of price variation is explained by size.
5. F-test: The F-statistic (140.8) is significant,
validating the overall model.
# Plotting the regression model on the graph
plot(hprice_data$sqrft, hprice_data$price, main="Simple Regression: Price vs Size", pch=16)
abline(simple_model, col="red", lwd=2)
We expand the model by adding the Number of Bedrooms (bdrms).
# Multiple Regression Model: Price depends on Size and Bedrooms
multiple_model <- lm(price ~ sqrft + bdrms, data = hprice_data)
summary(multiple_model)
##
## Call:
## lm(formula = price ~ sqrft + bdrms, data = hprice_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -127.627 -42.876 -7.051 32.589 229.003
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.31500 31.04662 -0.622 0.536
## sqrft 0.12844 0.01382 9.291 1.39e-14 ***
## bdrms 15.19819 9.48352 1.603 0.113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.04 on 85 degrees of freedom
## Multiple R-squared: 0.6319, Adjusted R-squared: 0.6233
## F-statistic: 72.96 on 2 and 85 DF, p-value: < 2.2e-16