QUESTION 1
data set used is quakes(the locations and magnitude of earthquakes in Fiji since 1964)
head(quakes)
## lat long depth mag stations
## 1 -20.42 181.62 562 4.8 41
## 2 -20.62 181.03 650 4.2 15
## 3 -26.00 184.10 42 5.4 43
## 4 -17.97 181.66 626 4.1 19
## 5 -20.42 181.96 649 4.0 11
## 6 -19.68 184.31 195 4.0 12
the first question is, is there a correlation between the number of stations reporting and the magnitude of the earthquake
plot(quakes$stations,quakes$mag,xlab = "stations detected",ylab = "magnitude(richter scale)")
Interesting, based on the scatter plot there is a clear correlation between the magnitude and the number of station detecting.the higher the number of station detecting the higher the magnitude.
now is there a correlation between the depth of earthquake and the magnitude
plot(quakes$depth,quakes$mag,xlab ="magnitude(richter scale)",ylab ="depth(km)" )
hmmmm, it seems like there isn’t any correlation between the depth and the magnitude of the earthquake,this means an earthquake’s depth wouldn’t effect the magnitude of it. next we would like to see if there is a correlation between the depth of the earthquake and the number of stations detecting it
plot(quakes$depth,quakes$stations,xlab ="depth(km)",ylab ="stations detected" )
based on the scatter plot,it doesnt have any correlation
next,we would like to know what magnitude of earthquake occurs more often
hist(quakes$mag,xlab= "magnitude(richter scale)",main = "histogram of magnitude")
his histogram helps to solve 2 questions the first is it helps to show that a magnitude 4.5 earthquake is the most frequent or the mode of the graph next, it also shows that a lower magnitude earthquake(magnitude 4.0 - 4.75) happens more frequently than a high magnitude earthquake(magnitude 5.0 and above)
now that we have the mode of the magnitudes.we also wonder what is the median and the extreme values of the dataset.to visualise this we need a boxplot graph
boxplot(quakes$mag)
the boxplot shows us that there is a few outliers, median is around 4.5 ,the minimum at 4.0 and maximum at around 6.4 the actual median is:-
median(quakes$mag)
## [1] 4.6
the actual maximum is:-
max(quakes$mag)
## [1] 6.4
the actual minimum is:-
min(quakes$mag)
## [1] 4
the actual interquartile range is:-
IQR(quakes$mag)
## [1] 0.6
finnaly we would like to know the mean of the magnitude the mean is:-
mean(quakes$mag)
## [1] 4.6204
this means the average magnitude is 4.62 .
now that we know a lot about the magnitude its time to analyse the depth collumn
hist(quakes$depth,xlab= "depth(km)",main = "histogram of depth")
this histogram shows that most earthquake happen at the depth of 50-100 the mode is 50-100
now that we have the mode of the depth.we also wonder what is the median and the extreme values of the dataset.to visualise this we need a boxplot graph
boxplot(quakes$depth)
the boxplot shows that there are no outliers , median is at around 240 , the minimum at around 45 and the maximum at around 670. the actual median is:-
median(quakes$depth)
## [1] 247
the actual maximum is:-
max(quakes$depth)
## [1] 680
the actual minimum is:-
min(quakes$depth)
## [1] 40
next,the actual interquartile range is:-
IQR(quakes$depth)
## [1] 444
finnaly we would like to know the mean of the magnitude the mean is:-
mean(quakes$depth)
## [1] 311.371
this means the average magnitude is 311.37 .
the codebook:-
#using old method
codebook(quakes)
## ================================================================================
##
## lat
##
## --------------------------------------------------------------------------------
##
## Storage mode: double
##
## Min: -38.590
## Max: -10.720
## Mean: -20.643
## Std.Dev.: 5.026
## Skewness: -0.676
## Kurtosis: 0.738
##
## ================================================================================
##
## long
##
## --------------------------------------------------------------------------------
##
## Storage mode: double
##
## Min: 165.670
## Max: 188.130
## Mean: 179.462
## Std.Dev.: 6.066
## Skewness: -1.163
## Kurtosis: 0.021
##
## ================================================================================
##
## depth
##
## --------------------------------------------------------------------------------
##
## Storage mode: integer
##
## Min: 40.000
## Max: 680.000
## Mean: 311.371
## Std.Dev.: 215.428
## Skewness: 0.198
## Kurtosis: -1.595
##
## ================================================================================
##
## mag
##
## --------------------------------------------------------------------------------
##
## Storage mode: double
##
## Min: 4.000
## Max: 6.400
## Mean: 4.620
## Std.Dev.: 0.403
## Skewness: 0.769
## Kurtosis: 0.510
##
## ================================================================================
##
## stations
##
## --------------------------------------------------------------------------------
##
## Storage mode: integer
##
## Min: 10.000
## Max: 132.000
## Mean: 33.418
## Std.Dev.: 21.889
## Skewness: 1.617
## Kurtosis: 2.683
codebook using new method:-
library(haven)
## Warning: package 'haven' was built under R version 4.1.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.2
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.2
## Warning: package 'tibble' was built under R version 4.1.2
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## Warning: package 'purrr' was built under R version 4.1.2
## Warning: package 'dplyr' was built under R version 4.1.2
## Warning: package 'forcats' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x purrr::%@%() masks memisc::%@%()
## x dplyr::collect() masks memisc::collect()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::recode() masks memisc::recode()
## x dplyr::rename() masks memisc::rename()
## x dplyr::select() masks MASS::select()
## x dplyr::syms() masks ggplot2::syms(), memisc::syms()
## x tibble::view() masks memisc::view()
library(summarytools)
## Warning: package 'summarytools' was built under R version 4.1.2
##
## Attaching package: 'summarytools'
## The following object is masked from 'package:tibble':
##
## view
## The following object is masked from 'package:memisc':
##
## view
DF <- dplyr::select(quakes,stations,depth,mag,long,lat)
print(dfSummary(DF, graph.magnif =.75, method='render'))
## Data Frame Summary
## DF
## Dimensions: 1000 x 5
## Duplicates: 0
##
## -------------------------------------------------------------------------------------------------------------
## No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
## ---- ----------- --------------------------- --------------------- --------------------- ---------- ---------
## 1 stations Mean (sd) : 33.4 (21.9) 102 distinct values : 1000 0
## [integer] min < med < max: : (100.0%) (0.0%)
## 10 < 27 < 132 : :
## IQR (CV) : 24 (0.7) : : .
## : : : : . . .
##
## 2 depth Mean (sd) : 311.4 (215.5) 422 distinct values : 1000 0
## [integer] min < med < max: : (100.0%) (0.0%)
## 40 < 247 < 680 : :
## IQR (CV) : 444 (0.7) : : : : :
## : : : : . : . : : :
##
## 3 mag Mean (sd) : 4.6 (0.4) 22 distinct values : 1000 0
## [numeric] min < med < max: : (100.0%) (0.0%)
## 4 < 4.6 < 6.4 : : :
## IQR (CV) : 0.6 (0.1) : : : : :
## : : : : : . .
##
## 4 long Mean (sd) : 179.5 (6.1) 605 distinct values . : 1000 0
## [numeric] min < med < max: : : (100.0%) (0.0%)
## 165.7 < 181.4 < 188.1 : : .
## IQR (CV) : 3.6 (0) : : : :
## : : . . : : : :
##
## 5 lat Mean (sd) : -20.6 (5) 721 distinct values : 1000 0
## [numeric] min < med < max: . : : (100.0%) (0.0%)
## -38.6 < -20.3 < -10.7 : : : .
## IQR (CV) : 5.8 (-0.2) . : : : : .
## . . : : : : : : :
## -------------------------------------------------------------------------------------------------------------
QUESTION 2
dataset used is USArrest
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
## Connecticut 3.3 110 77 11.1
## Delaware 5.9 238 72 15.8
## Florida 15.4 335 80 31.9
## Georgia 17.4 211 60 25.8
## Hawaii 5.3 46 83 20.2
## Idaho 2.6 120 54 14.2
## Illinois 10.4 249 83 24.0
## Indiana 7.2 113 65 21.0
## Iowa 2.2 56 57 11.3
filter() is able to filter out the dataset,for example here i wanna filter the dataset so that only a state with Assault arrests more than 200
library(dplyr)
filter(USArrests,Assault>200)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
## Delaware 5.9 238 72 15.8
## Florida 15.4 335 80 31.9
## Georgia 17.4 211 60 25.8
## Illinois 10.4 249 83 24.0
## Louisiana 15.4 249 66 22.2
## Maryland 11.3 300 67 27.8
## Michigan 12.1 255 74 35.1
## Mississippi 16.1 259 44 17.1
## Nevada 12.2 252 81 46.0
## New Mexico 11.4 285 70 32.1
## New York 11.1 254 86 26.1
## North Carolina 13.0 337 45 16.1
## South Carolina 14.4 279 48 22.5
## Texas 12.7 201 80 25.5
arrange() is able to sort your dataset,for example here i wanna sort the data starting from the state with the lowest number of Assault arrests to the highest
library(dplyr)
arrange(USArrests,Assault)
## Murder Assault UrbanPop Rape
## North Dakota 0.8 45 44 7.3
## Hawaii 5.3 46 83 20.2
## Vermont 2.2 48 32 11.2
## Wisconsin 2.6 53 66 10.8
## Iowa 2.2 56 57 11.3
## New Hampshire 2.1 57 56 9.5
## Minnesota 2.7 72 66 14.9
## West Virginia 5.7 81 39 9.3
## Maine 2.1 83 51 7.8
## South Dakota 3.8 86 45 12.8
## Nebraska 4.3 102 62 16.5
## Pennsylvania 6.3 106 72 14.9
## Kentucky 9.7 109 52 16.3
## Montana 6.0 109 53 16.4
## Connecticut 3.3 110 77 11.1
## Indiana 7.2 113 65 21.0
## Kansas 6.0 115 66 18.0
## Idaho 2.6 120 54 14.2
## Ohio 7.3 120 75 21.4
## Utah 3.2 120 80 22.9
## Washington 4.0 145 73 26.2
## Massachusetts 4.4 149 85 16.3
## Oklahoma 6.6 151 68 20.0
## Virginia 8.5 156 63 20.7
## New Jersey 7.4 159 89 18.8
## Oregon 4.9 159 67 29.3
## Wyoming 6.8 161 60 15.6
## Rhode Island 3.4 174 87 8.3
## Missouri 9.0 178 70 28.2
## Tennessee 13.2 188 59 26.9
## Arkansas 8.8 190 50 19.5
## Texas 12.7 201 80 25.5
## Colorado 7.9 204 78 38.7
## Georgia 17.4 211 60 25.8
## Alabama 13.2 236 58 21.2
## Delaware 5.9 238 72 15.8
## Illinois 10.4 249 83 24.0
## Louisiana 15.4 249 66 22.2
## Nevada 12.2 252 81 46.0
## New York 11.1 254 86 26.1
## Michigan 12.1 255 74 35.1
## Mississippi 16.1 259 44 17.1
## Alaska 10.0 263 48 44.5
## California 9.0 276 91 40.6
## South Carolina 14.4 279 48 22.5
## New Mexico 11.4 285 70 32.1
## Arizona 8.1 294 80 31.0
## Maryland 11.3 300 67 27.8
## Florida 15.4 335 80 31.9
## North Carolina 13.0 337 45 16.1
mutate() is able to change the dataset ,for example here i want to add another collumn named TotalArrests.The collumn will be the sum of all the Arrest in that state
library(dplyr)
data1 <- mutate(USArrests, TotalArrests = Assault+Rape+UrbanPop+Murder)
data1
## Murder Assault UrbanPop Rape TotalArrests
## Alabama 13.2 236 58 21.2 328.4
## Alaska 10.0 263 48 44.5 365.5
## Arizona 8.1 294 80 31.0 413.1
## Arkansas 8.8 190 50 19.5 268.3
## California 9.0 276 91 40.6 416.6
## Colorado 7.9 204 78 38.7 328.6
## Connecticut 3.3 110 77 11.1 201.4
## Delaware 5.9 238 72 15.8 331.7
## Florida 15.4 335 80 31.9 462.3
## Georgia 17.4 211 60 25.8 314.2
## Hawaii 5.3 46 83 20.2 154.5
## Idaho 2.6 120 54 14.2 190.8
## Illinois 10.4 249 83 24.0 366.4
## Indiana 7.2 113 65 21.0 206.2
## Iowa 2.2 56 57 11.3 126.5
## Kansas 6.0 115 66 18.0 205.0
## Kentucky 9.7 109 52 16.3 187.0
## Louisiana 15.4 249 66 22.2 352.6
## Maine 2.1 83 51 7.8 143.9
## Maryland 11.3 300 67 27.8 406.1
## Massachusetts 4.4 149 85 16.3 254.7
## Michigan 12.1 255 74 35.1 376.2
## Minnesota 2.7 72 66 14.9 155.6
## Mississippi 16.1 259 44 17.1 336.2
## Missouri 9.0 178 70 28.2 285.2
## Montana 6.0 109 53 16.4 184.4
## Nebraska 4.3 102 62 16.5 184.8
## Nevada 12.2 252 81 46.0 391.2
## New Hampshire 2.1 57 56 9.5 124.6
## New Jersey 7.4 159 89 18.8 274.2
## New Mexico 11.4 285 70 32.1 398.5
## New York 11.1 254 86 26.1 377.2
## North Carolina 13.0 337 45 16.1 411.1
## North Dakota 0.8 45 44 7.3 97.1
## Ohio 7.3 120 75 21.4 223.7
## Oklahoma 6.6 151 68 20.0 245.6
## Oregon 4.9 159 67 29.3 260.2
## Pennsylvania 6.3 106 72 14.9 199.2
## Rhode Island 3.4 174 87 8.3 272.7
## South Carolina 14.4 279 48 22.5 363.9
## South Dakota 3.8 86 45 12.8 147.6
## Tennessee 13.2 188 59 26.9 287.1
## Texas 12.7 201 80 25.5 319.2
## Utah 3.2 120 80 22.9 226.1
## Vermont 2.2 48 32 11.2 93.4
## Virginia 8.5 156 63 20.7 248.2
## Washington 4.0 145 73 26.2 248.2
## West Virginia 5.7 81 39 9.3 135.0
## Wisconsin 2.6 53 66 10.8 132.4
## Wyoming 6.8 161 60 15.6 243.4
select() is able to pick a certain part of your dataset,for example here i wanna pick only the TotalArrests collumn
library(dplyr)
select(data1,TotalArrests)
## TotalArrests
## Alabama 328.4
## Alaska 365.5
## Arizona 413.1
## Arkansas 268.3
## California 416.6
## Colorado 328.6
## Connecticut 201.4
## Delaware 331.7
## Florida 462.3
## Georgia 314.2
## Hawaii 154.5
## Idaho 190.8
## Illinois 366.4
## Indiana 206.2
## Iowa 126.5
## Kansas 205.0
## Kentucky 187.0
## Louisiana 352.6
## Maine 143.9
## Maryland 406.1
## Massachusetts 254.7
## Michigan 376.2
## Minnesota 155.6
## Mississippi 336.2
## Missouri 285.2
## Montana 184.4
## Nebraska 184.8
## Nevada 391.2
## New Hampshire 124.6
## New Jersey 274.2
## New Mexico 398.5
## New York 377.2
## North Carolina 411.1
## North Dakota 97.1
## Ohio 223.7
## Oklahoma 245.6
## Oregon 260.2
## Pennsylvania 199.2
## Rhode Island 272.7
## South Carolina 363.9
## South Dakota 147.6
## Tennessee 287.1
## Texas 319.2
## Utah 226.1
## Vermont 93.4
## Virginia 248.2
## Washington 248.2
## West Virginia 135.0
## Wisconsin 132.4
## Wyoming 243.4
summarise() is able to do a summary of your dataset,for example,here i want to see the mean for the TotalArrests collum.
library(dplyr)
summarise(data1,Totalmean=mean(TotalArrests))
## Totalmean
## 1 265.32