Q1

Write down all codes with the interpretation of the result. (Vector, Matrix, Summary of Vector and Matrix, make data frame (with 10 rows and 5 Columns), Summary of data frame, Interpretation of Summary)

Create a numeric vector

my_vector <- c(5, 10, 15, 20, 25)

# Print the vector
print(my_vector)

## [1]  5 10 15 20 25

Interpretation: The numeric vector contains a sequence of evenly spaced values increasing by 5, representing a simple linear pattern.

Create a Matrix

my_matrix <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3)

# Print the matrix
print(my_matrix)

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Interpretation: This is a 2x3 matrix filled column-wise with integers from 1 to 6.

Summarize Vector and Matrix

# Summary statistics of the vector
summary(my_vector)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       5      10      15      15      20      25

#Interpretation- The vector has evenly spaced numbers, resulting in a symmetric distribution where the mean and median are equal.


# Summary of the matrix (each column treated as a vector)
summary(my_matrix)

##        V1             V2             V3      
##  Min.   :1.00   Min.   :3.00   Min.   :5.00  
##  1st Qu.:1.25   1st Qu.:3.25   1st Qu.:5.25  
##  Median :1.50   Median :3.50   Median :5.50  
##  Mean   :1.50   Mean   :3.50   Mean   :5.50  
##  3rd Qu.:1.75   3rd Qu.:3.75   3rd Qu.:5.75  
##  Max.   :2.00   Max.   :4.00   Max.   :6.00

Interpretation:All columns have a balanced spread with equal means and medians, indicating uniform and symmetric values across the matrix.

Create a data frame

df <- data.frame(
  ID = 1:10,
  Age = c(21, 23, 22, 24, 25, 21, 22, 23, 24, 22),
  Score = c(88, 75, 90, 85, 92, 79, 80, 84, 87, 91),
  Passed = c(TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE),
  Gender = c("F", "M", "F", "M", "F", "M", "F", "F", "M", "F")
)

Print the data frame

print(df)

##    ID Age Score Passed Gender
## 1   1  21    88   TRUE      F
## 2   2  23    75  FALSE      M
## 3   3  22    90   TRUE      F
## 4   4  24    85   TRUE      M
## 5   5  25    92   TRUE      F
## 6   6  21    79  FALSE      M
## 7   7  22    80   TRUE      F
## 8   8  23    84   TRUE      F
## 9   9  24    87   TRUE      M
## 10 10  22    91   TRUE      F

Data Frame Interpretation This dataset contains information on 10 individuals with their ID, Age, Score, Pass/Fail status, and Gender. It includes both numerical and categorical data.

Summary of the data frame

summary(df)

##        ID             Age            Score        Passed       
##  Min.   : 1.00   Min.   :21.00   Min.   :75.0   Mode :logical  
##  1st Qu.: 3.25   1st Qu.:22.00   1st Qu.:81.0   FALSE:2        
##  Median : 5.50   Median :22.50   Median :86.0   TRUE :8        
##  Mean   : 5.50   Mean   :22.70   Mean   :85.1                  
##  3rd Qu.: 7.75   3rd Qu.:23.75   3rd Qu.:89.5                  
##  Max.   :10.00   Max.   :25.00   Max.   :92.0                  
##     Gender         
##  Length:10         
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Interpretation of the summary: Age ranges from 21 to 25 with a mean of 22.7 — most individuals are in their early twenties. Score ranges from 75 to 92, with a mean of 85.1 — showing generally good performance. Passed: 8 out of 10 students passed, indicating a high pass rate. Gender includes both male and female, slightly more females in the dataset.

Q2

Explain the process of importing data from Windows (CSV file, Excel file) and packages. Interpret imported data with a summary explained according to graphs. Plot histogram, box plot, 3d plot, and frequency density (For Plotting, use data either a CSV file or an Excel file).

Install these packages

#install.packages("readr")
#install.packages("readxl")

library(readr) # For read_csv
library(readxl) # For read_excel

Import Data from CSV or Excel

library(readxl)

# Replace with your actual file path
data <- read_excel("C:/Users/lenovo/OneDrive/Desktop/excel data/project_data.xlsx")

# View the first few rows
head(data)

## # A tibble: 5 × 5
##   Name   Favorite_Movie Genre     Year Rating
##   <chr>  <chr>          <chr>    <dbl>  <dbl>
## 1 Azra   Little Women   Drama     2019    9.2
## 2 Ayaan  Interstellar   Sci-Fi    2014    9.5
## 3 Fatima Queen          Comedy    2013    8.8
## 4 Zoya   The Notebook   Romance   2004    8.5
## 5 Rehan  Inception      Thriller  2010    9.4

Summary of data

# View first few rows
head(data)

## # A tibble: 5 × 5
##   Name   Favorite_Movie Genre     Year Rating
##   <chr>  <chr>          <chr>    <dbl>  <dbl>
## 1 Azra   Little Women   Drama     2019    9.2
## 2 Ayaan  Interstellar   Sci-Fi    2014    9.5
## 3 Fatima Queen          Comedy    2013    8.8
## 4 Zoya   The Notebook   Romance   2004    8.5
## 5 Rehan  Inception      Thriller  2010    9.4

# View structure of the dataset
str(data)

## tibble [5 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Name          : chr [1:5] "Azra" "Ayaan" "Fatima" "Zoya" ...
##  $ Favorite_Movie: chr [1:5] "Little Women" "Interstellar" "Queen" "The Notebook" ...
##  $ Genre         : chr [1:5] "Drama" "Sci-Fi" "Comedy" "Romance" ...
##  $ Year          : num [1:5] 2019 2014 2013 2004 2010
##  $ Rating        : num [1:5] 9.2 9.5 8.8 8.5 9.4

# Summary of numeric variables
summary(data)

##      Name           Favorite_Movie        Genre                Year     
##  Length:5           Length:5           Length:5           Min.   :2004  
##  Class :character   Class :character   Class :character   1st Qu.:2010  
##  Mode  :character   Mode  :character   Mode  :character   Median :2013  
##                                                           Mean   :2012  
##                                                           3rd Qu.:2014  
##                                                           Max.   :2019  
##      Rating    
##  Min.   :8.50  
##  1st Qu.:8.80  
##  Median :9.20  
##  Mean   :9.08  
##  3rd Qu.:9.40  
##  Max.   :9.50

Interpretation of Summary Name, Favorite_Movie, Genre: These are text (character) variables with 5 entries each, representing individual preferences. Movies range from 2004 to 2019. The average release year is around 2012, with most movies released between 2010 and 2014. Rating: Ratings range from 8.5 to 9.5, showing that all selected movies are highly rated. The median rating is 9.2, and the average (mean) is 9.08, indicating consistent high preferences. The top 25% of ratings (3rd quartile) are above 9.4.

#install.packages("ggplot2")
#install.packages("plotly")

library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

Histogram

ggplot(data, aes(x = Rating)) +
  geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Ratings", x = "Rating", y = "Frequency")

Interpretation: Most ratings fall between 8.8 and 9.5, showing a concentration of high preferences. The distribution is right-skewed, indicating that slightly fewer movies are rated below 9.

Box Plot

ggplot(data, aes(y = Rating)) +
  geom_boxplot(fill = "orange") +
  labs(title = "Boxplot of Ratings", y = "Rating")

Interpretation: The median rating is around 9.2, showing that half of the movies are rated higher than this. There’s no significant outlier, and the ratings are tightly clustered, indicating consistently high choices.

3D plot: Year (x), Rating (y), Name (z)

plot_ly(data = data,
        x = ~Year,           # Movie release year
        y = ~Rating,         # Movie rating
        z = ~Name,           # Person's name
        type = "scatter3d",
        mode = "markers",
        marker = list(size = 5, color = ~Rating, colorscale = "Viridis")) %>%
  layout(title = "3D Plot: Year vs Rating vs Name")

Interpretation: This 3D plot shows how different individuals (Name axis) have rated their favorite movies released in various years. We observe: Most favorite movies are recent (after 2010). All ratings are high, above 8.5. Each person has a unique movie preference across genres with strong ratings.

Frequency Density Plot

# Load ggplot2 package
library(ggplot2)

# Create the density plot
ggplot(data, aes(x = Rating)) +
  geom_density(fill = "lightblue", alpha = 0.6, color = "darkblue") +
  labs(title = "Frequency Density Plot of Movie Ratings",
       x = "Rating",
       y = "Density") +
  theme_minimal()

Interpretation: This density plot shows that: Most movie ratings are concentrated between 8.8 and 9.5. The distribution is slightly left-skewed, with fewer movies rated below 9. It reflects that everyone in the dataset chose movies they really liked, with no low-rated entries.

Q3

Take other imported data and do correlation with interpretation, simple linear regression with interpretation (Check and interpret: Intercept and Slope, variance of parameter and error term, t-test, p-value, Adj R square, F test). Plot regression model on the graph.

Load required packages

library(readxl)

# Import the data from the Excel file
data <- read_excel("C:/Users/lenovo/OneDrive/Desktop/excel data/study_hour.xlsx")

Correlation Analysis

Calculate the correlation

correlation <- cor(data$Hours_Studied, data$Exam_Score)
print(correlation)

## [1] 0.967746

interpretation The correlation coefficient of 0.967746 indicates a strong positive relationship between hours studied and exam scores, meaning that as study hours increase, exam scores tend to increase significantly.

Simple Linear Regression

Fit a simple linear regression model

model <- lm(Exam_Score ~ Hours_Studied, data = data)

# Print the model summary
summary(model)

## 
## Call:
## lm(formula = Exam_Score ~ Hours_Studied, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.503 -3.911 -1.800  4.386  7.327 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     53.333      3.475   15.35 3.23e-07 ***
## Hours_Studied    3.042      0.280   10.87 4.55e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.087 on 8 degrees of freedom
## Multiple R-squared:  0.9365, Adjusted R-squared:  0.9286 
## F-statistic:   118 on 1 and 8 DF,  p-value: 4.554e-06

Interpretation # Regression Model Summary:

Intercept (53.333): When Hours_Studied is 0, the model predicts an exam score of 53.33. This is the starting point on the exam score scale.

Slope (3.042): For each additional hour studied, the exam score increases by 3.042 points. This shows a strong positive relationship between study hours and exam performance.

Significance of Parameters: The p-values for both the Intercept (3.23e-07) and Hours_Studied (4.55e-06) are very small, meaning both parameters are highly significant in predicting exam scores.

Residual Analysis: Residuals: The residuals (difference between observed and predicted values) range from -5.503 to 7.327. The median residual (-1.800) indicates the model has some errors, but it’s not too far from zero.

Model Fit: Multiple R-squared (0.9365): About 93.65% of the variation in exam scores is explained by the number of hours studied.

Adjusted R-squared (0.9286): After adjusting for the number of predictors (only one in this case), 92.86% of the variation is still explained by study hours, indicating a good model fit.

F-statistic (118): The high F-statistic and a very small p-value (4.554e-06) show that the model is statistically significant, and the independent variable (Hours_Studied) is a good predictor of exam scores.

Plot the Regression Model

Plot the data and the regression line

ggplot(data, aes(x = Hours_Studied, y = Exam_Score)) +
  geom_point() + 
  geom_smooth(method = "lm", col = "red") + 
  labs(title = "Regression Line: Hours Studied vs Exam Score",
       x = "Hours Studied",
       y = "Exam Score")

## `geom_smooth()` using formula = 'y ~ x'

Q4

Take the multiple regression model and compare and interpret with a simple Linear regression model (basis of comparison: Intercept and Slope, variance of parameter and error term, t-test, R Square, Adj R square, F test).

IMPORT THE EXCEL DATA

library(readxl)
data <- read_excel("C:/Users/lenovo/OneDrive/Desktop/excel data/Sleep_Hours.xlsx")

model_simple <- lm(Exam_Score ~ Hours_Studied, data = data)
summary(model_simple)

## 
## Call:
## lm(formula = Exam_Score ~ Hours_Studied, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.503 -3.911 -1.800  4.386  7.327 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     53.333      3.475   15.35 3.23e-07 ***
## Hours_Studied    3.042      0.280   10.87 4.55e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.087 on 8 degrees of freedom
## Multiple R-squared:  0.9365, Adjusted R-squared:  0.9286 
## F-statistic:   118 on 1 and 8 DF,  p-value: 4.554e-06

model_multiple <- lm(Exam_Score ~ Hours_Studied + Sleep_Hours, data = data)
summary(model_multiple)

## 
## Call:
## lm(formula = Exam_Score ~ Hours_Studied + Sleep_Hours, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.480 -3.884 -1.582  4.218  7.774 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    48.8220    32.0541   1.523   0.1715  
## Hours_Studied   3.2095     1.2161   2.639   0.0335 *
## Sleep_Hours     0.5457     3.8515   0.142   0.8913  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.43 on 7 degrees of freedom
## Multiple R-squared:  0.9367, Adjusted R-squared:  0.9186 
## F-statistic:  51.8 on 2 and 7 DF,  p-value: 6.376e-05

Comparison of Simple and Multiple Linear Regression Models

Model 1: Simple Linear Regression

Model: Exam_Score ~ Hours_Studied

Intercept : 53.333
Slope : 3.042
R-squared : 0.9365
Adjusted R² : 0.9286
Std. Error : 5.087
F-statistic : 118
p-value : 4.554e-06

Model 2: Multiple Linear Regression

Model: Exam_Score ~ Hours_Studied + Sleep_Hours

Intercept : 48.822
Slope (Hours) : 3.209
Slope (Sleep) : 0.546
R-squared : 0.9367
Adjusted R² : 0.9186
Std. Error : 5.43
F-statistic : 51.8
p-value : 6.376e-05

INTERPRETATION AND COMPARISON:

Intercept & Slopes: Both models show that Hours_Studied is a significant positive

predictor of Exam_Score. In the multiple model, the slope is slightly higher (3.21 vs. 3.04), but the added predictor Sleep_Hours has a very high p-value (0.8913), suggesting it’s not significant.

Adjusted R²: The simple model (0.9286) slightly outperforms the multiple model (0.9186), indicating that adding Sleep_Hours didn’t improve prediction after adjusting for extra variables.

F-statistic: The simple model’s F-value (118) is higher than the multiple model (51.8), showing the simple model has a stronger overall fit.

Conclusion: Adding Sleep_Hours does not significantly improve the model. Simple linear regression with Hours_Studied alone is more effective and interpretable in this case.

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

ECONOMETRICS PROJECT

Azra Khanum

2025-04-16

Q1