Write down all codes with the interpretation of the result. (Vector, Matrix, Summary of Vector and Matrix, make data frame (with 10 rows and 5 Columns), Summary of data frame, Interpretation of Summary)
my_vector <- c(5, 10, 15, 20, 25)
# Print the vector
print(my_vector)
## [1] 5 10 15 20 25
Interpretation: The numeric vector contains a sequence of evenly spaced values increasing by 5, representing a simple linear pattern.
my_matrix <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3)
# Print the matrix
print(my_matrix)
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
Interpretation: This is a 2x3 matrix filled column-wise with integers from 1 to 6.
# Summary statistics of the vector
summary(my_vector)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5 10 15 15 20 25
#Interpretation- The vector has evenly spaced numbers, resulting in a symmetric distribution where the mean and median are equal.
# Summary of the matrix (each column treated as a vector)
summary(my_matrix)
## V1 V2 V3
## Min. :1.00 Min. :3.00 Min. :5.00
## 1st Qu.:1.25 1st Qu.:3.25 1st Qu.:5.25
## Median :1.50 Median :3.50 Median :5.50
## Mean :1.50 Mean :3.50 Mean :5.50
## 3rd Qu.:1.75 3rd Qu.:3.75 3rd Qu.:5.75
## Max. :2.00 Max. :4.00 Max. :6.00
Interpretation:All columns have a balanced spread with equal means and medians, indicating uniform and symmetric values across the matrix.
df <- data.frame(
ID = 1:10,
Age = c(21, 23, 22, 24, 25, 21, 22, 23, 24, 22),
Score = c(88, 75, 90, 85, 92, 79, 80, 84, 87, 91),
Passed = c(TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE),
Gender = c("F", "M", "F", "M", "F", "M", "F", "F", "M", "F")
)
print(df)
## ID Age Score Passed Gender
## 1 1 21 88 TRUE F
## 2 2 23 75 FALSE M
## 3 3 22 90 TRUE F
## 4 4 24 85 TRUE M
## 5 5 25 92 TRUE F
## 6 6 21 79 FALSE M
## 7 7 22 80 TRUE F
## 8 8 23 84 TRUE F
## 9 9 24 87 TRUE M
## 10 10 22 91 TRUE F
Data Frame Interpretation This dataset contains information on 10 individuals with their ID, Age, Score, Pass/Fail status, and Gender. It includes both numerical and categorical data.
summary(df)
## ID Age Score Passed
## Min. : 1.00 Min. :21.00 Min. :75.0 Mode :logical
## 1st Qu.: 3.25 1st Qu.:22.00 1st Qu.:81.0 FALSE:2
## Median : 5.50 Median :22.50 Median :86.0 TRUE :8
## Mean : 5.50 Mean :22.70 Mean :85.1
## 3rd Qu.: 7.75 3rd Qu.:23.75 3rd Qu.:89.5
## Max. :10.00 Max. :25.00 Max. :92.0
## Gender
## Length:10
## Class :character
## Mode :character
##
##
##
Interpretation of the summary: Age ranges from 21 to 25 with a mean of 22.7 — most individuals are in their early twenties. Score ranges from 75 to 92, with a mean of 85.1 — showing generally good performance. Passed: 8 out of 10 students passed, indicating a high pass rate. Gender includes both male and female, slightly more females in the dataset.
Explain the process of importing data from Windows (CSV file, Excel file) and packages. Interpret imported data with a summary explained according to graphs. Plot histogram, box plot, 3d plot, and frequency density (For Plotting, use data either a CSV file or an Excel file).
#install.packages("readr")
#install.packages("readxl")
library(readr) # For read_csv
library(readxl) # For read_excel
library(readxl)
# Replace with your actual file path
data <- read_excel("C:/Users/lenovo/OneDrive/Desktop/excel data/project_data.xlsx")
# View the first few rows
head(data)
## # A tibble: 5 × 5
## Name Favorite_Movie Genre Year Rating
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Azra Little Women Drama 2019 9.2
## 2 Ayaan Interstellar Sci-Fi 2014 9.5
## 3 Fatima Queen Comedy 2013 8.8
## 4 Zoya The Notebook Romance 2004 8.5
## 5 Rehan Inception Thriller 2010 9.4
# View first few rows
head(data)
## # A tibble: 5 × 5
## Name Favorite_Movie Genre Year Rating
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Azra Little Women Drama 2019 9.2
## 2 Ayaan Interstellar Sci-Fi 2014 9.5
## 3 Fatima Queen Comedy 2013 8.8
## 4 Zoya The Notebook Romance 2004 8.5
## 5 Rehan Inception Thriller 2010 9.4
# View structure of the dataset
str(data)
## tibble [5 × 5] (S3: tbl_df/tbl/data.frame)
## $ Name : chr [1:5] "Azra" "Ayaan" "Fatima" "Zoya" ...
## $ Favorite_Movie: chr [1:5] "Little Women" "Interstellar" "Queen" "The Notebook" ...
## $ Genre : chr [1:5] "Drama" "Sci-Fi" "Comedy" "Romance" ...
## $ Year : num [1:5] 2019 2014 2013 2004 2010
## $ Rating : num [1:5] 9.2 9.5 8.8 8.5 9.4
# Summary of numeric variables
summary(data)
## Name Favorite_Movie Genre Year
## Length:5 Length:5 Length:5 Min. :2004
## Class :character Class :character Class :character 1st Qu.:2010
## Mode :character Mode :character Mode :character Median :2013
## Mean :2012
## 3rd Qu.:2014
## Max. :2019
## Rating
## Min. :8.50
## 1st Qu.:8.80
## Median :9.20
## Mean :9.08
## 3rd Qu.:9.40
## Max. :9.50
Interpretation of Summary Name, Favorite_Movie, Genre: These are text (character) variables with 5 entries each, representing individual preferences. Movies range from 2004 to 2019. The average release year is around 2012, with most movies released between 2010 and 2014. Rating: Ratings range from 8.5 to 9.5, showing that all selected movies are highly rated. The median rating is 9.2, and the average (mean) is 9.08, indicating consistent high preferences. The top 25% of ratings (3rd quartile) are above 9.4.
#install.packages("ggplot2")
#install.packages("plotly")
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
ggplot(data, aes(x = Rating)) +
geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
labs(title = "Histogram of Ratings", x = "Rating", y = "Frequency")
Interpretation: Most ratings fall between 8.8 and 9.5, showing a concentration of high preferences. The distribution is right-skewed, indicating that slightly fewer movies are rated below 9.
ggplot(data, aes(y = Rating)) +
geom_boxplot(fill = "orange") +
labs(title = "Boxplot of Ratings", y = "Rating")
Interpretation: The median rating is around 9.2, showing that half of the movies are rated higher than this. There’s no significant outlier, and the ratings are tightly clustered, indicating consistently high choices.
plot_ly(data = data,
x = ~Year, # Movie release year
y = ~Rating, # Movie rating
z = ~Name, # Person's name
type = "scatter3d",
mode = "markers",
marker = list(size = 5, color = ~Rating, colorscale = "Viridis")) %>%
layout(title = "3D Plot: Year vs Rating vs Name")
Interpretation: This 3D plot shows how different individuals (Name axis) have rated their favorite movies released in various years. We observe: Most favorite movies are recent (after 2010). All ratings are high, above 8.5. Each person has a unique movie preference across genres with strong ratings.
# Load ggplot2 package
library(ggplot2)
# Create the density plot
ggplot(data, aes(x = Rating)) +
geom_density(fill = "lightblue", alpha = 0.6, color = "darkblue") +
labs(title = "Frequency Density Plot of Movie Ratings",
x = "Rating",
y = "Density") +
theme_minimal()
Interpretation: This density plot shows that: Most movie ratings are concentrated between 8.8 and 9.5. The distribution is slightly left-skewed, with fewer movies rated below 9. It reflects that everyone in the dataset chose movies they really liked, with no low-rated entries.
Take other imported data and do correlation with interpretation, simple linear regression with interpretation (Check and interpret: Intercept and Slope, variance of parameter and error term, t-test, p-value, Adj R square, F test). Plot regression model on the graph.
library(readxl)
# Import the data from the Excel file
data <- read_excel("C:/Users/lenovo/OneDrive/Desktop/excel data/study_hour.xlsx")
correlation <- cor(data$Hours_Studied, data$Exam_Score)
print(correlation)
## [1] 0.967746
interpretation The correlation coefficient of 0.967746 indicates a strong positive relationship between hours studied and exam scores, meaning that as study hours increase, exam scores tend to increase significantly.
model <- lm(Exam_Score ~ Hours_Studied, data = data)
# Print the model summary
summary(model)
##
## Call:
## lm(formula = Exam_Score ~ Hours_Studied, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.503 -3.911 -1.800 4.386 7.327
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.333 3.475 15.35 3.23e-07 ***
## Hours_Studied 3.042 0.280 10.87 4.55e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.087 on 8 degrees of freedom
## Multiple R-squared: 0.9365, Adjusted R-squared: 0.9286
## F-statistic: 118 on 1 and 8 DF, p-value: 4.554e-06
Interpretation # Regression Model Summary:
Intercept (53.333): When Hours_Studied is 0, the model predicts an exam score of 53.33. This is the starting point on the exam score scale.
Slope (3.042): For each additional hour studied, the exam score increases by 3.042 points. This shows a strong positive relationship between study hours and exam performance.
Significance of Parameters: The p-values for both the Intercept (3.23e-07) and Hours_Studied (4.55e-06) are very small, meaning both parameters are highly significant in predicting exam scores.
Residual Analysis: Residuals: The residuals (difference between observed and predicted values) range from -5.503 to 7.327. The median residual (-1.800) indicates the model has some errors, but it’s not too far from zero.
Model Fit: Multiple R-squared (0.9365): About 93.65% of the variation in exam scores is explained by the number of hours studied.
Adjusted R-squared (0.9286): After adjusting for the number of predictors (only one in this case), 92.86% of the variation is still explained by study hours, indicating a good model fit.
F-statistic (118): The high F-statistic and a very small p-value (4.554e-06) show that the model is statistically significant, and the independent variable (Hours_Studied) is a good predictor of exam scores.
ggplot(data, aes(x = Hours_Studied, y = Exam_Score)) +
geom_point() +
geom_smooth(method = "lm", col = "red") +
labs(title = "Regression Line: Hours Studied vs Exam Score",
x = "Hours Studied",
y = "Exam Score")
## `geom_smooth()` using formula = 'y ~ x'
Take the multiple regression model and compare and interpret with a simple Linear regression model (basis of comparison: Intercept and Slope, variance of parameter and error term, t-test, R Square, Adj R square, F test).
library(readxl)
data <- read_excel("C:/Users/lenovo/OneDrive/Desktop/excel data/Sleep_Hours.xlsx")
model_simple <- lm(Exam_Score ~ Hours_Studied, data = data)
summary(model_simple)
##
## Call:
## lm(formula = Exam_Score ~ Hours_Studied, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.503 -3.911 -1.800 4.386 7.327
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.333 3.475 15.35 3.23e-07 ***
## Hours_Studied 3.042 0.280 10.87 4.55e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.087 on 8 degrees of freedom
## Multiple R-squared: 0.9365, Adjusted R-squared: 0.9286
## F-statistic: 118 on 1 and 8 DF, p-value: 4.554e-06
model_multiple <- lm(Exam_Score ~ Hours_Studied + Sleep_Hours, data = data)
summary(model_multiple)
##
## Call:
## lm(formula = Exam_Score ~ Hours_Studied + Sleep_Hours, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.480 -3.884 -1.582 4.218 7.774
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.8220 32.0541 1.523 0.1715
## Hours_Studied 3.2095 1.2161 2.639 0.0335 *
## Sleep_Hours 0.5457 3.8515 0.142 0.8913
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.43 on 7 degrees of freedom
## Multiple R-squared: 0.9367, Adjusted R-squared: 0.9186
## F-statistic: 51.8 on 2 and 7 DF, p-value: 6.376e-05
Comparison of Simple and Multiple Linear Regression Models
Model: Exam_Score ~ Hours_Studied
Intercept : 53.333
Slope : 3.042
R-squared : 0.9365
Adjusted R² : 0.9286
Std. Error : 5.087
F-statistic : 118
p-value : 4.554e-06
Model: Exam_Score ~ Hours_Studied + Sleep_Hours
Intercept : 48.822
Slope (Hours) : 3.209
Slope (Sleep) : 0.546
R-squared : 0.9367
Adjusted R² : 0.9186
Std. Error : 5.43
F-statistic : 51.8
p-value : 6.376e-05
INTERPRETATION AND COMPARISON:
Intercept & Slopes: Both models show that Hours_Studied is a significant positive
predictor of Exam_Score. In the multiple model, the slope is slightly higher (3.21 vs. 3.04), but the added predictor Sleep_Hours has a very high p-value (0.8913), suggesting it’s not significant.
Adjusted R²: The simple model (0.9286) slightly outperforms the multiple model (0.9186), indicating that adding Sleep_Hours didn’t improve prediction after adjusting for extra variables.
F-statistic: The simple model’s F-value (118) is higher than the multiple model (51.8), showing the simple model has a stronger overall fit.
Conclusion: Adding Sleep_Hours does not significantly improve the model. Simple linear regression with Hours_Studied alone is more effective and interpretable in this case.
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.