Part (a): Basic R Operations

In this section, we demonstrate the creation and summary of basic data structures in R, including vectors, matrices, and data frames.

1. Vector and its Summary

A vector is a basic data structure in R that holds a sequence of elements of the same type.

# Creating a vector of 5 numbers
my_vector <- c(10, 25, 30, 45, 50)
print(my_vector)
## [1] 10 25 30 45 50
# Summary of the vector
summary(my_vector)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      10      25      30      32      45      50

Interpretation: The summary shows the Minimum (10), Maximum (50), and the Median (30) of our numbers. It helps us understand the range and central tendency of our data.

2. Matrix and its Summary

A matrix is a collection of data elements arranged in a two-dimensional rectangular layout.

# Creating a matrix with 2 rows and 5 columns
my_matrix <- matrix(1:10, nrow = 2, ncol = 5)
print(my_matrix)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
# Summary of the matrix
summary(my_matrix)
##        V1             V2             V3             V4             V5       
##  Min.   :1.00   Min.   :3.00   Min.   :5.00   Min.   :7.00   Min.   : 9.00  
##  1st Qu.:1.25   1st Qu.:3.25   1st Qu.:5.25   1st Qu.:7.25   1st Qu.: 9.25  
##  Median :1.50   Median :3.50   Median :5.50   Median :7.50   Median : 9.50  
##  Mean   :1.50   Mean   :3.50   Mean   :5.50   Mean   :7.50   Mean   : 9.50  
##  3rd Qu.:1.75   3rd Qu.:3.75   3rd Qu.:5.75   3rd Qu.:7.75   3rd Qu.: 9.75  
##  Max.   :2.00   Max.   :4.00   Max.   :6.00   Max.   :8.00   Max.   :10.00

Interpretation: The summary provides statistics for each column of the matrix.

3. Data Frame and its Summary

A data frame is a table where each column can contain different types of data.

# Creating a data frame with 10 rows and 5 columns
student_data <- data.frame(
  ID = 1:10,
  Marks = c(85, 78, 92, 88, 76, 95, 89, 84, 91, 80),
  Attendance = c(90, 85, 95, 88, 80, 98, 92, 87, 94, 82),
  Age = c(20, 21, 19, 22, 20, 21, 19, 20, 22, 21),
  Gender = factor(rep(c("Male", "Female"), 5))
)

# Showing the data frame
print(student_data)
##    ID Marks Attendance Age Gender
## 1   1    85         90  20   Male
## 2   2    78         85  21 Female
## 3   3    92         95  19   Male
## 4   4    88         88  22 Female
## 5   5    76         80  20   Male
## 6   6    95         98  21 Female
## 7   7    89         92  19   Male
## 8   8    84         87  20 Female
## 9   9    91         94  22   Male
## 10 10    80         82  21 Female
# Summary of the data frame
summary(student_data)
##        ID            Marks        Attendance        Age          Gender 
##  Min.   : 1.00   Min.   :76.0   Min.   :80.0   Min.   :19.0   Female:5  
##  1st Qu.: 3.25   1st Qu.:81.0   1st Qu.:85.5   1st Qu.:20.0   Male  :5  
##  Median : 5.50   Median :86.5   Median :89.0   Median :20.5             
##  Mean   : 5.50   Mean   :85.8   Mean   :89.1   Mean   :20.5             
##  3rd Qu.: 7.75   3rd Qu.:90.5   3rd Qu.:93.5   3rd Qu.:21.0             
##  Max.   :10.00   Max.   :95.0   Max.   :98.0   Max.   :22.0

Interpretation of Summary: The summary() function calculates the descriptive statistics for every column. For numeric variables like ‘Marks’, it shows the Mean and Median. For categorical variables like ‘Gender’, it shows the count of each category.

Part (b): Data Importing and Visualization

1. Process of Importing Data

To work with external data in Windows, we use specific functions: * CSV Files: We use read.csv(“file_path.csv”). * Excel Files: We first install the readxl package using install.packages(“readxl”) and then use library(readxl) followed by read_excel(“file_path.xlsx”).

2. Visualizations using Car Data

We will use the built-in mtcars dataset (Motor Trend Car Road Tests).

# Load the data
data(mtcars)

# 1. Histogram (Frequency of Miles Per Gallon)
hist(mtcars$mpg, col="skyblue", main="Histogram of MPG", xlab="Miles Per Gallon")

# 2. Box Plot (Range of Horsepower)
boxplot(mtcars$hp, col="orange", main="Box Plot of Horsepower", ylab="Horsepower")

# 3. Scatter Plot (MPG vs Weight)
plot(mtcars$wt, mtcars$mpg, main="Scatter Plot: MPG vs Weight", 
     xlab="Weight", ylab="MPG", pch=19, col="blue")

# 4. Frequency Density (Density of MPG)
plot(density(mtcars$mpg), main="Frequency Density of MPG", col="red", lwd=2)

Part (c): Correlation and Simple Linear Regression

We will check how Horsepower (hp) affects Miles Per Gallon (mpg).

# 1. Correlation
correlation_value <- cor(mtcars$hp, mtcars$mpg)
print(paste("Correlation:", correlation_value))
## [1] "Correlation: -0.776168371826586"
# 2. Simple Linear Regression Model
# mpg (Dependent) ~ hp (Independent)
simple_model <- lm(mpg ~ hp, data = mtcars)
summary(simple_model)
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

Interpretation: * Intercept & Slope: The intercept is the predicted MPG when HP is zero. The slope for ‘hp’ is negative, meaning as Horsepower increases, MPG decreases. * T-test & P-value: The p-value for ‘hp’ is very small (less than 0.05), suggesting that Horsepower is a significant predictor. * Adj R-square: This tells us how much of the variation in MPG is explained by Horsepower alone.

# Plotting the regression model
plot(mtcars$hp, mtcars$mpg, main="Regression: MPG on Horsepower")
abline(simple_model, col="red", lwd=2)

Part (d): Multiple Regression Comparison

Now we add Weight (wt) as a second independent variable to see if the model improves.

# Multiple Regression Model
multiple_model <- lm(mpg ~ hp + wt, data = mtcars)
summary(multiple_model)
## 
## Call:
## lm(formula = mpg ~ hp + wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
## hp          -0.03177    0.00903  -3.519  0.00145 ** 
## wt          -3.87783    0.63273  -6.129 1.12e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

Comparison Table

Basis of Comparison Simple Model (MPG ~ HP) Multiple Model (MPG ~ HP + WT)
R-Square ~0.60 ~0.82
Adjusted R-Square ~0.59 ~0.81
F-test Significant More Significant

Interpretation: The Multiple Regression model is better because the Adjusted R-square increased significantly (from ~0.60 to ~0.81). This means adding ‘Weight’ explains much more of the variation in car mileage than just ‘Horsepower’ alone. Both variables have significant T-tests as their p-values are below 0.05.