Wildblueberry

INTRODUCTION

This is one the the research data on wildblueberries and used to predict yield and behaviour of bees. I found this data in kaggle database. This data base includes 251 observation and all together 7 variables of different catogerical and quantative variables. clonesize,rain,fruitsize are the catogerical and average temperature, seeds and yields are quantative variables.

Clean Dirty Data.

I took alot of time to finalize my final data. I wanted to work in the sales of Avocado and was working on it but i figure out alot of dirty data. i used some techique to short and filter some data to clean some of my dirty data.

Yield the main thing we gonna look to preditcs is relationship withother variables.

FOR YIELD (MEAN, SD, SUMMARY)

df <- read.csv("wildblueberry.csv")
mean(df$yield)

## [1] 5817.303

sd(df$yield)

## [1] 1490.436

summary(df$yield)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1946    4703    6031    5817    7020    8622

HISTOGRAM OF YIELD

hist(df$yield,
     main = "HIST yield",
     xlab = "YIELD(KG)",
     col = "purple",
     probability = TRUE)
     lines(df$yield, col ="blue")

BOXPLOT

I used hist function to show in histogram of function of yield

boxplot(df$yield,
        main = "BOX PLOT",
        horizontal = FALSE,
     xlab = "  YIELD (KG)")

I used boxplot.

qqnorm(df$yield,
       main = "QQplot Yield",
       ylab = "YIELD (KG)")
qqline(df$yield, col = "red")

qqline function to show path with red line. qqnorm function is used to show in grapgical distribution.

OUTLIERS

In data set, there will be some data which are relative far from each others those value are called outliers. i have used to IQR function to detrerming some outliers.

IQR(df$yield, na.rm = FALSE)

## [1] 2316.45

boxplot(df$yield, plot = FALSE) $out

## numeric(0)

outliners <- boxplot(df$yield, plot = FALSE)$out
outliners

## numeric(0)

This are the values of data set which are outliers. We dont have alot of outliners.

GHAPHICAL DISPLAY OF MULTIPLE VARIABLES AND THEIR CORRELATIPON

plot(df$yield,df$averageT,
     main = "MULTIPLE VARIABLES GRAPHICAL DISPLAY",
     xlab = "YIELD(KG)",
     ylab = "AVERAGE TEMP(.C)",
     pch = 20)

I have used plot for this multiple variable graphical display.

FREQUENCY TABLE & RELATIVE FREQUENCY TABLE

TABLES FREQUENCY TABLE

table(df$clonesize)

## 
##  20  35 
## 226  25

RELATIVE FREQUENCY TABLE

table(df$clonesize/length(df$clonesize))

## 
## 0.0796812749003984  0.139442231075697 
##                226                 25

TWO WAY TABLE (fruit size relation with colone size)

two_way_table <-table(df$fruitsize,df$clonesize)

This table helps us to understand that small size colony of bees helps to produce more large fruits than big size colony.

SIDE BY SIDE PLOT(one categorical and quantative)

par(mfrow=c(1,2))
plot(df$yield,
      main = "scatter plot of yield(KG)",
     xlab = "",
     ylab = "yield")
plot(df$clonesize,
      main = "Scatter plot of colony",
     xlab = "",
     ylab = "Colone size")

print("yield summary:")

## [1] "yield summary:"

summary(df$yield)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1946    4703    6031    5817    7020    8622

print("colonesize summary:")

## [1] "colonesize summary:"

summary(df$clonesize)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.00   20.00   20.00   21.49   20.00   35.00

VISUALIZAQTION OF DATA (BAR CHART, SCATTER PLOT & HEAT MAP)

BAR CHART

barplot(df$clonesize,
         main = "BAR CHART",
        xlab = "Colone Size")

SCATTER PLOT

scatter.smooth(df$clonesize,
               main = "SCATTER PLOT")

HEAT MAP

data <-read.csv("wildblueberry.csv",header = TRUE)
data <-data.matrix(data[,-1])
heatmap(t(data),
          main = "HEAT MAP",
          Rowv = NA,
          Colv = NA,
          col = heat.colors(200,alpha = 1,rev = FALSE),
          scale = "row")

summary(df$seeds)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    20.3    32.3    36.0    35.6    39.1    46.9

CONCLUSION.

this is my first time working in R markdown and i really liked the results come even it give took alot of time to figure out things.I really enjoy the project. i felt like i can ply with some data.