In this section, we demonstrate the creation and summary of basic data structures in R, including vectors, matrices, and data frames.
A vector is a basic data structure in R that holds a sequence of elements of the same type.
# Creating a vector of 5 numbers
my_vector <- c(10, 25, 30, 45, 50)
print(my_vector)
## [1] 10 25 30 45 50
# Summary of the vector
summary(my_vector)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10 25 30 32 45 50
Interpretation: The summary shows the Minimum (10), Maximum (50), and the Median (30) of our numbers. It helps us understand the range and central tendency of our data.
A matrix is a collection of data elements arranged in a two-dimensional rectangular layout.
# Creating a matrix with 2 rows and 5 columns
my_matrix <- matrix(1:10, nrow = 2, ncol = 5)
print(my_matrix)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
# Summary of the matrix
summary(my_matrix)
## V1 V2 V3 V4 V5
## Min. :1.00 Min. :3.00 Min. :5.00 Min. :7.00 Min. : 9.00
## 1st Qu.:1.25 1st Qu.:3.25 1st Qu.:5.25 1st Qu.:7.25 1st Qu.: 9.25
## Median :1.50 Median :3.50 Median :5.50 Median :7.50 Median : 9.50
## Mean :1.50 Mean :3.50 Mean :5.50 Mean :7.50 Mean : 9.50
## 3rd Qu.:1.75 3rd Qu.:3.75 3rd Qu.:5.75 3rd Qu.:7.75 3rd Qu.: 9.75
## Max. :2.00 Max. :4.00 Max. :6.00 Max. :8.00 Max. :10.00
Interpretation: The summary provides statistics for each column of the matrix.
A data frame is a table where each column can contain different types of data.
# Creating a data frame with 10 rows and 5 columns
student_data <- data.frame(
ID = 1:10,
Marks = c(85, 78, 92, 88, 76, 95, 89, 84, 91, 80),
Attendance = c(90, 85, 95, 88, 80, 98, 92, 87, 94, 82),
Age = c(20, 21, 19, 22, 20, 21, 19, 20, 22, 21),
Gender = factor(rep(c("Male", "Female"), 5))
)
# Showing the data frame
print(student_data)
## ID Marks Attendance Age Gender
## 1 1 85 90 20 Male
## 2 2 78 85 21 Female
## 3 3 92 95 19 Male
## 4 4 88 88 22 Female
## 5 5 76 80 20 Male
## 6 6 95 98 21 Female
## 7 7 89 92 19 Male
## 8 8 84 87 20 Female
## 9 9 91 94 22 Male
## 10 10 80 82 21 Female
# Summary of the data frame
summary(student_data)
## ID Marks Attendance Age Gender
## Min. : 1.00 Min. :76.0 Min. :80.0 Min. :19.0 Female:5
## 1st Qu.: 3.25 1st Qu.:81.0 1st Qu.:85.5 1st Qu.:20.0 Male :5
## Median : 5.50 Median :86.5 Median :89.0 Median :20.5
## Mean : 5.50 Mean :85.8 Mean :89.1 Mean :20.5
## 3rd Qu.: 7.75 3rd Qu.:90.5 3rd Qu.:93.5 3rd Qu.:21.0
## Max. :10.00 Max. :95.0 Max. :98.0 Max. :22.0
Interpretation of Summary: The summary() function calculates the descriptive statistics for every column. For numeric variables like ‘Marks’, it shows the Mean and Median. For categorical variables like ‘Gender’, it shows the count of each category.
To work with external data in Windows, we use specific functions: * CSV Files: We use read.csv(“file_path.csv”). * Excel Files: We first install the readxl package using install.packages(“readxl”) and then use library(readxl) followed by read_excel(“file_path.xlsx”).
We will use the built-in mtcars dataset (Motor Trend Car Road Tests).
# Load the data
data(mtcars)
# 1. Histogram (Frequency of Miles Per Gallon)
hist(mtcars$mpg, col="skyblue", main="Histogram of MPG", xlab="Miles Per Gallon")
# 2. Box Plot (Range of Horsepower)
boxplot(mtcars$hp, col="orange", main="Box Plot of Horsepower", ylab="Horsepower")
# 3. Scatter Plot (MPG vs Weight)
plot(mtcars$wt, mtcars$mpg, main="Scatter Plot: MPG vs Weight",
xlab="Weight", ylab="MPG", pch=19, col="blue")
# 4. Frequency Density (Density of MPG)
plot(density(mtcars$mpg), main="Frequency Density of MPG", col="red", lwd=2)
We will check how Horsepower (hp) affects Miles Per Gallon (mpg).
# 1. Correlation
correlation_value <- cor(mtcars$hp, mtcars$mpg)
print(paste("Correlation:", correlation_value))
## [1] "Correlation: -0.776168371826586"
# 2. Simple Linear Regression Model
# mpg (Dependent) ~ hp (Independent)
simple_model <- lm(mpg ~ hp, data = mtcars)
summary(simple_model)
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
Interpretation: * Intercept & Slope: The intercept is the predicted MPG when HP is zero. The slope for ‘hp’ is negative, meaning as Horsepower increases, MPG decreases. * T-test & P-value: The p-value for ‘hp’ is very small (less than 0.05), suggesting that Horsepower is a significant predictor. * Adj R-square: This tells us how much of the variation in MPG is explained by Horsepower alone.
# Plotting the regression model
plot(mtcars$hp, mtcars$mpg, main="Regression: MPG on Horsepower")
abline(simple_model, col="red", lwd=2)
Now we add Weight (wt) as a second independent variable to see if the model improves.
# Multiple Regression Model
multiple_model <- lm(mpg ~ hp + wt, data = mtcars)
summary(multiple_model)
##
## Call:
## lm(formula = mpg ~ hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.941 -1.600 -0.182 1.050 5.854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
## hp -0.03177 0.00903 -3.519 0.00145 **
## wt -3.87783 0.63273 -6.129 1.12e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148
## F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
| Basis of Comparison | Simple Model (MPG ~ HP) | Multiple Model (MPG ~ HP + WT) |
|---|---|---|
| R-Square | ~0.60 | ~0.82 |
| Adjusted R-Square | ~0.59 | ~0.81 |
| F-test | Significant | More Significant |
Interpretation: The Multiple Regression model is better because the Adjusted R-square increased significantly (from ~0.60 to ~0.81). This means adding ‘Weight’ explains much more of the variation in car mileage than just ‘Horsepower’ alone. Both variables have significant T-tests as their p-values are below 0.05.