R Markdown is a powerful tool that can be used to combine text, code, and equations in a single document. The orange tree dataset is a classic dataset that provides 35 observations with 3 variables. It provides the age (in days) and circumference (in mm) for five type of Orange trees. We have manipulate, visualize data and built a regression model to predict the age of the orange tree based on the given circumference.
Here, we are analyzing different aspects of orange tree by doing Exploratory,visualization and regression Data Analysis.
Frist,we are going to import the libraries
library(ggplot2)
library(hrbrthemes)
library(tidyverse)
library(viridis)
load the data and display the head
data=data.frame(Orange)
head(Orange)
We perform some initial EDA using the DataExplorer package.
library(DataExplorer)
create_report(Orange)
check for null values
is.null(data)
## [1] FALSE
No null values are display
summary(data)
## Tree age circumference
## 3:7 Min. : 118.0 Min. : 30.0
## 1:7 1st Qu.: 484.0 1st Qu.: 65.5
## 5:7 Median :1004.0 Median :115.0
## 2:7 Mean : 922.1 Mean :115.9
## 4:7 3rd Qu.:1372.0 3rd Qu.:161.5
## Max. :1582.0 Max. :214.0
age Column: The smallest age in the dataset is 118.0 days. 25% of the trees have an age less than or equal to 484.0 days. 50% of the trees have an age less than or equal to 1004.0 days (also known as the second quartile). The average age of all the trees is approximately 922.1 days. 75% of the trees have an age less than or equal to 1372.0 days. The oldest tree in the dataset is 1582.0 days. circumference Column: The smallest circumference in the dataset is 30.0 mm. 25% of the trees have a circumference less than or equal to 65.5 mm. 50% of the trees have a circumference less than or equal to 115.0 mm (also known as the second quartile). The average circumference of all the trees is approximately 115.9 mm. 75% of the trees have a circumference less than or equal to 161.5 mm. The largest circumference in the dataset is 214.0 mm.
attach(data)
cor(circumference,age)
## [1] 0.9135189
we have calculate the correlation between circumference and age of tree the correlation coefficient of 0.9135189 suggests a strong positive linear relationship between circumference and age , suggesting that circumference and age are highly positively associated with each other. As the circumference of the orange tree increases, its age tends to increase as well.
Scatter plots
plot(circumference, age)
Scatter plot represent that the points on the scatter plot are clustered closely around the fit line with a clear upward trend, it confirms a positive linear relationship between age and circumference.
ggplot(data, aes(x=circumference, y=age, color=Tree)) +
geom_point(size=2) +
theme_ipsum()
The above scatter plot summarizes the relationship of age and circumference for different types of orange trees it represent highlights the different growth patterns of orange trees of different types. Type 4 trees experience a more rapid increase in circumference, reaching the highest values among all types, while Type 3 trees exhibit slower growth and have a smaller circumference compared to the other types at the same age.
bar plot of the tree circumference
ggplot(data, aes(x=factor(Tree), y=circumference,fill=Tree, color=Tree)) +
geom_bar(stat='identity',position = 'dodge')
Bar plot shows that tree of type 4 has the highest circumference among all the trees. Its bar in the plot is the tallest, indicating its large size compared to the other trees. however, for a type 3 orange tree has the least circumference among all the trees. indicating its smaller size compared to the other trees.
box plot of circumference among different type of tree
data %>%
ggplot( aes(x=Tree, y=circumference, fill=Tree)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("Basic boxplot") +
xlab("")
data %>%
ggplot( aes(x=Tree, y=circumference, fill=Tree)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6) +
geom_jitter(color="black", size=0.4, alpha=0.9) +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("A boxplot with jitter") +
xlab("")
Based on the box plot above Type 4 has the highest median circumference among all the types. The median value falls closer to the upper end of the box. Additionally, the box itself is larger, suggesting a wider range of circumference values for this type. Type 2 and 5 has a moderately large box, indicating a significant spread of circumference values. suggesting a wide range of circumference values. However, its median lower than that of Type 4. Type 1 and 3, suggesting a narrower range of circumference values. The median value falls close to the upper of the box, Type 3 has the smallest median circumference among all the types.
model <- lm(age ~ circumference , data = Orange)
summary(model)
##
## Call:
## lm(formula = age ~ circumference, data = Orange)
##
## Residuals:
## Min 1Q Median 3Q Max
## -317.88 -140.90 -17.20 96.54 471.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.6036 78.1406 0.212 0.833
## circumference 7.8160 0.6059 12.900 1.93e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 203.1 on 33 degrees of freedom
## Multiple R-squared: 0.8345, Adjusted R-squared: 0.8295
## F-statistic: 166.4 on 1 and 33 DF, p-value: 1.931e-14
plot(model)
The output appears to be the result of a linear regression analysis
between the variable age and circumference in the data set Orange The
coefficients are used to form the equation of the linear regression
line: age = 16.6036 + 7.8160 * circumference
Intercept (16.6036) predicted age when the
circumference of the orange tree is zero. However, since circumference
cannot be negative, this interpretation is not practically meaningful in
this context. Coefficient for Circumference (7.8160
indicates the change in the predicted age for every one-unit increase in
the circumference of the orange tree. In this case, for each additional
unit of circumference, the predicted age increases by 7.8160 years. The
“Pr(>|t|)” column provides the p-value, which is
used to test the significance of each coefficient. In this case, the
circumference variable is highly significant (p-value: 1.93e-14)
Multiple R-squared and Adjusted R-squared: The multiple
R-squared is a measure of how well the model fits the data. It
represents the proportion of variance in the age variable that can be
explained by the circumference variable. In this case, the multiple
R-squared is 0.8345, indicating that approximately 83.45% of the
variance in age can be explained by circumference. The adjusted
R-squared adjusts the multiple R-squared for the number of
predictors in the model. It penalizes the inclusion of irrelevant
predictors. In this case, the adjusted R-squared is 0.8295, which is
slightly lower than the multiple R-squared. - F-statistic and
p-value: The F-statistic is used to test the overall
significance of the model. It tests whether there is a significant
linear relationship between the predictor circumference and the response
age. In this case, the F-statistic is 166.4, and the p-value is
1.931e-14 (extremely small). This indicates that the model is highly
significant, and there is a strong linear relationship between age and
circumference.
circumferenceDF <- data.frame(circumference = 1243)
age <- predict(model, circumferenceDF)
# Display the age
cat("The age is:", age)
## The age is: 9731.89
Above code lies in its ability to make predictions based on the given (circumference) using a pre-trained linear regression model.
In conclusion, our project investigated five types of orange trees and identified Type 4 as the most likely to achieve the highest circumference. Additionally, we observed a positive relationship between circumference and age. The predictive model we developed holds promise for age estimation. However, we emphasize the importance of increasing the sample size to further enhance the model’s accuracy.