AAQ1

QUESTION 1

data set used is quakes(the locations and magnitude of earthquakes in Fiji since 1964)

head(quakes)

##      lat   long depth mag stations
## 1 -20.42 181.62   562 4.8       41
## 2 -20.62 181.03   650 4.2       15
## 3 -26.00 184.10    42 5.4       43
## 4 -17.97 181.66   626 4.1       19
## 5 -20.42 181.96   649 4.0       11
## 6 -19.68 184.31   195 4.0       12

the first question is, is there a correlation between the number of stations reporting and the magnitude of the earthquake

plot(quakes$stations,quakes$mag,xlab = "stations detected",ylab = "magnitude(richter scale)")

Interesting, based on the scatter plot there is a clear correlation between the magnitude and the number of station detecting.the higher the number of station detecting the higher the magnitude.

now is there a correlation between the depth of earthquake and the magnitude

plot(quakes$depth,quakes$mag,xlab ="magnitude(richter scale)",ylab ="depth(km)" )

hmmmm, it seems like there isn’t any correlation between the depth and the magnitude of the earthquake,this means an earthquake’s depth wouldn’t effect the magnitude of it. next we would like to see if there is a correlation between the depth of the earthquake and the number of stations detecting it

  plot(quakes$depth,quakes$stations,xlab ="depth(km)",ylab ="stations detected" )

based on the scatter plot,it doesnt have any correlation

next,we would like to know what magnitude of earthquake occurs more often

hist(quakes$mag,xlab= "magnitude(richter scale)",main = "histogram of magnitude")

his histogram helps to solve 2 questions the first is it helps to show that a magnitude 4.5 earthquake is the most frequent or the mode of the graph next, it also shows that a lower magnitude earthquake(magnitude 4.0 - 4.75) happens more frequently than a high magnitude earthquake(magnitude 5.0 and above)

now that we have the mode of the magnitudes.we also wonder what is the median and the extreme values of the dataset.to visualise this we need a boxplot graph

boxplot(quakes$mag)

the boxplot shows us that there is a few outliers, median is around 4.5 ,the minimum at 4.0 and maximum at around 6.4 the actual median is:-

median(quakes$mag)

## [1] 4.6

the actual maximum is:-

max(quakes$mag)

## [1] 6.4

the actual minimum is:-

min(quakes$mag)

## [1] 4

the actual interquartile range is:-

IQR(quakes$mag)

## [1] 0.6

finnaly we would like to know the mean of the magnitude the mean is:-

mean(quakes$mag)

## [1] 4.6204

this means the average magnitude is 4.62 .

now that we know a lot about the magnitude its time to analyse the depth collumn

hist(quakes$depth,xlab= "depth(km)",main = "histogram of depth")

this histogram shows that most earthquake happen at the depth of 50-100 the mode is 50-100

now that we have the mode of the depth.we also wonder what is the median and the extreme values of the dataset.to visualise this we need a boxplot graph

boxplot(quakes$depth)

the boxplot shows that there are no outliers , median is at around 240 , the minimum at around 45 and the maximum at around 670. the actual median is:-

median(quakes$depth)

## [1] 247

the actual maximum is:-

max(quakes$depth)

## [1] 680

the actual minimum is:-

min(quakes$depth)

## [1] 40

next,the actual interquartile range is:-

IQR(quakes$depth)

## [1] 444

finnaly we would like to know the mean of the magnitude the mean is:-

mean(quakes$depth)

## [1] 311.371

this means the average magnitude is 311.37 .

the codebook:-

#using old method
 codebook(quakes)

## ================================================================================
## 
##    lat
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: double
## 
##         Min: -38.590
##         Max: -10.720
##        Mean: -20.643
##    Std.Dev.:   5.026
##    Skewness:  -0.676
##    Kurtosis:   0.738
## 
## ================================================================================
## 
##    long
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: double
## 
##         Min: 165.670
##         Max: 188.130
##        Mean: 179.462
##    Std.Dev.:   6.066
##    Skewness:  -1.163
##    Kurtosis:   0.021
## 
## ================================================================================
## 
##    depth
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: integer
## 
##         Min:  40.000
##         Max: 680.000
##        Mean: 311.371
##    Std.Dev.: 215.428
##    Skewness:   0.198
##    Kurtosis:  -1.595
## 
## ================================================================================
## 
##    mag
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: double
## 
##         Min: 4.000
##         Max: 6.400
##        Mean: 4.620
##    Std.Dev.: 0.403
##    Skewness: 0.769
##    Kurtosis: 0.510
## 
## ================================================================================
## 
##    stations
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: integer
## 
##         Min:  10.000
##         Max: 132.000
##        Mean:  33.418
##    Std.Dev.:  21.889
##    Skewness:   1.617
##    Kurtosis:   2.683

codebook using new method:-

 library(haven)

## Warning: package 'haven' was built under R version 4.1.2

 library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.2

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 4.1.2

## Warning: package 'tibble' was built under R version 4.1.2

## Warning: package 'tidyr' was built under R version 4.1.2

## Warning: package 'readr' was built under R version 4.1.2

## Warning: package 'purrr' was built under R version 4.1.2

## Warning: package 'dplyr' was built under R version 4.1.2

## Warning: package 'forcats' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x purrr::%@%()     masks memisc::%@%()
## x dplyr::collect() masks memisc::collect()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()
## x dplyr::recode()  masks memisc::recode()
## x dplyr::rename()  masks memisc::rename()
## x dplyr::select()  masks MASS::select()
## x dplyr::syms()    masks ggplot2::syms(), memisc::syms()
## x tibble::view()   masks memisc::view()

library(summarytools)

## Warning: package 'summarytools' was built under R version 4.1.2

## 
## Attaching package: 'summarytools'

## The following object is masked from 'package:tibble':
## 
##     view

## The following object is masked from 'package:memisc':
## 
##     view

 DF <- dplyr::select(quakes,stations,depth,mag,long,lat)
 
print(dfSummary(DF, graph.magnif =.75, method='render'))

## Data Frame Summary  
## DF  
## Dimensions: 1000 x 5  
## Duplicates: 0  
## 
## -------------------------------------------------------------------------------------------------------------
## No   Variable    Stats / Values              Freqs (% of Valid)    Graph                 Valid      Missing  
## ---- ----------- --------------------------- --------------------- --------------------- ---------- ---------
## 1    stations    Mean (sd) : 33.4 (21.9)     102 distinct values   :                     1000       0        
##      [integer]   min < med < max:                                  :                     (100.0%)   (0.0%)   
##                  10 < 27 < 132                                     : :                                       
##                  IQR (CV) : 24 (0.7)                               : : .                                     
##                                                                    : : : : . . .                             
## 
## 2    depth       Mean (sd) : 311.4 (215.5)   422 distinct values   :                     1000       0        
##      [integer]   min < med < max:                                  :                     (100.0%)   (0.0%)   
##                  40 < 247 < 680                                    :               :                         
##                  IQR (CV) : 444 (0.7)                              : : :         : :                         
##                                                                    : : : : . : . : : :                       
## 
## 3    mag         Mean (sd) : 4.6 (0.4)       22 distinct values        :                 1000       0        
##      [numeric]   min < med < max:                                      :                 (100.0%)   (0.0%)   
##                  4 < 4.6 < 6.4                                     : : :                                     
##                  IQR (CV) : 0.6 (0.1)                              : : : : :                                 
##                                                                    : : : : : . .                             
## 
## 4    long        Mean (sd) : 179.5 (6.1)     605 distinct values               . :       1000       0        
##      [numeric]   min < med < max:                                              : :       (100.0%)   (0.0%)   
##                  165.7 < 181.4 < 188.1                                         : : .                         
##                  IQR (CV) : 3.6 (0)                                :           : : :                         
##                                                                    : : .     . : : : :                       
## 
## 5    lat         Mean (sd) : -20.6 (5)       721 distinct values               :         1000       0        
##      [numeric]   min < med < max:                                            . : :       (100.0%)   (0.0%)   
##                  -38.6 < -20.3 < -10.7                                       : : : .                         
##                  IQR (CV) : 5.8 (-0.2)                                     . : : : : .                       
##                                                                      . . : : : : : : :                       
## -------------------------------------------------------------------------------------------------------------

QUESTION 2

dataset used is USArrest

##             Murder Assault UrbanPop Rape
## Alabama       13.2     236       58 21.2
## Alaska        10.0     263       48 44.5
## Arizona        8.1     294       80 31.0
## Arkansas       8.8     190       50 19.5
## California     9.0     276       91 40.6
## Colorado       7.9     204       78 38.7
## Connecticut    3.3     110       77 11.1
## Delaware       5.9     238       72 15.8
## Florida       15.4     335       80 31.9
## Georgia       17.4     211       60 25.8
## Hawaii         5.3      46       83 20.2
## Idaho          2.6     120       54 14.2
## Illinois      10.4     249       83 24.0
## Indiana        7.2     113       65 21.0
## Iowa           2.2      56       57 11.3

i)filter()

filter() is able to filter out the dataset,for example here i wanna filter the dataset so that only a state with Assault arrests more than 200

library(dplyr)
filter(USArrests,Assault>200)

##                Murder Assault UrbanPop Rape
## Alabama          13.2     236       58 21.2
## Alaska           10.0     263       48 44.5
## Arizona           8.1     294       80 31.0
## California        9.0     276       91 40.6
## Colorado          7.9     204       78 38.7
## Delaware          5.9     238       72 15.8
## Florida          15.4     335       80 31.9
## Georgia          17.4     211       60 25.8
## Illinois         10.4     249       83 24.0
## Louisiana        15.4     249       66 22.2
## Maryland         11.3     300       67 27.8
## Michigan         12.1     255       74 35.1
## Mississippi      16.1     259       44 17.1
## Nevada           12.2     252       81 46.0
## New Mexico       11.4     285       70 32.1
## New York         11.1     254       86 26.1
## North Carolina   13.0     337       45 16.1
## South Carolina   14.4     279       48 22.5
## Texas            12.7     201       80 25.5

ii)arrange()

arrange() is able to sort your dataset,for example here i wanna sort the data starting from the state with the lowest number of Assault arrests to the highest

library(dplyr)
arrange(USArrests,Assault)

##                Murder Assault UrbanPop Rape
## North Dakota      0.8      45       44  7.3
## Hawaii            5.3      46       83 20.2
## Vermont           2.2      48       32 11.2
## Wisconsin         2.6      53       66 10.8
## Iowa              2.2      56       57 11.3
## New Hampshire     2.1      57       56  9.5
## Minnesota         2.7      72       66 14.9
## West Virginia     5.7      81       39  9.3
## Maine             2.1      83       51  7.8
## South Dakota      3.8      86       45 12.8
## Nebraska          4.3     102       62 16.5
## Pennsylvania      6.3     106       72 14.9
## Kentucky          9.7     109       52 16.3
## Montana           6.0     109       53 16.4
## Connecticut       3.3     110       77 11.1
## Indiana           7.2     113       65 21.0
## Kansas            6.0     115       66 18.0
## Idaho             2.6     120       54 14.2
## Ohio              7.3     120       75 21.4
## Utah              3.2     120       80 22.9
## Washington        4.0     145       73 26.2
## Massachusetts     4.4     149       85 16.3
## Oklahoma          6.6     151       68 20.0
## Virginia          8.5     156       63 20.7
## New Jersey        7.4     159       89 18.8
## Oregon            4.9     159       67 29.3
## Wyoming           6.8     161       60 15.6
## Rhode Island      3.4     174       87  8.3
## Missouri          9.0     178       70 28.2
## Tennessee        13.2     188       59 26.9
## Arkansas          8.8     190       50 19.5
## Texas            12.7     201       80 25.5
## Colorado          7.9     204       78 38.7
## Georgia          17.4     211       60 25.8
## Alabama          13.2     236       58 21.2
## Delaware          5.9     238       72 15.8
## Illinois         10.4     249       83 24.0
## Louisiana        15.4     249       66 22.2
## Nevada           12.2     252       81 46.0
## New York         11.1     254       86 26.1
## Michigan         12.1     255       74 35.1
## Mississippi      16.1     259       44 17.1
## Alaska           10.0     263       48 44.5
## California        9.0     276       91 40.6
## South Carolina   14.4     279       48 22.5
## New Mexico       11.4     285       70 32.1
## Arizona           8.1     294       80 31.0
## Maryland         11.3     300       67 27.8
## Florida          15.4     335       80 31.9
## North Carolina   13.0     337       45 16.1

iii)mutate()

mutate() is able to change the dataset ,for example here i want to add another collumn named TotalArrests.The collumn will be the sum of all the Arrest in that state

library(dplyr)
data1 <- mutate(USArrests, TotalArrests = Assault+Rape+UrbanPop+Murder)
data1

##                Murder Assault UrbanPop Rape TotalArrests
## Alabama          13.2     236       58 21.2        328.4
## Alaska           10.0     263       48 44.5        365.5
## Arizona           8.1     294       80 31.0        413.1
## Arkansas          8.8     190       50 19.5        268.3
## California        9.0     276       91 40.6        416.6
## Colorado          7.9     204       78 38.7        328.6
## Connecticut       3.3     110       77 11.1        201.4
## Delaware          5.9     238       72 15.8        331.7
## Florida          15.4     335       80 31.9        462.3
## Georgia          17.4     211       60 25.8        314.2
## Hawaii            5.3      46       83 20.2        154.5
## Idaho             2.6     120       54 14.2        190.8
## Illinois         10.4     249       83 24.0        366.4
## Indiana           7.2     113       65 21.0        206.2
## Iowa              2.2      56       57 11.3        126.5
## Kansas            6.0     115       66 18.0        205.0
## Kentucky          9.7     109       52 16.3        187.0
## Louisiana        15.4     249       66 22.2        352.6
## Maine             2.1      83       51  7.8        143.9
## Maryland         11.3     300       67 27.8        406.1
## Massachusetts     4.4     149       85 16.3        254.7
## Michigan         12.1     255       74 35.1        376.2
## Minnesota         2.7      72       66 14.9        155.6
## Mississippi      16.1     259       44 17.1        336.2
## Missouri          9.0     178       70 28.2        285.2
## Montana           6.0     109       53 16.4        184.4
## Nebraska          4.3     102       62 16.5        184.8
## Nevada           12.2     252       81 46.0        391.2
## New Hampshire     2.1      57       56  9.5        124.6
## New Jersey        7.4     159       89 18.8        274.2
## New Mexico       11.4     285       70 32.1        398.5
## New York         11.1     254       86 26.1        377.2
## North Carolina   13.0     337       45 16.1        411.1
## North Dakota      0.8      45       44  7.3         97.1
## Ohio              7.3     120       75 21.4        223.7
## Oklahoma          6.6     151       68 20.0        245.6
## Oregon            4.9     159       67 29.3        260.2
## Pennsylvania      6.3     106       72 14.9        199.2
## Rhode Island      3.4     174       87  8.3        272.7
## South Carolina   14.4     279       48 22.5        363.9
## South Dakota      3.8      86       45 12.8        147.6
## Tennessee        13.2     188       59 26.9        287.1
## Texas            12.7     201       80 25.5        319.2
## Utah              3.2     120       80 22.9        226.1
## Vermont           2.2      48       32 11.2         93.4
## Virginia          8.5     156       63 20.7        248.2
## Washington        4.0     145       73 26.2        248.2
## West Virginia     5.7      81       39  9.3        135.0
## Wisconsin         2.6      53       66 10.8        132.4
## Wyoming           6.8     161       60 15.6        243.4

iv)select()

select() is able to pick a certain part of your dataset,for example here i wanna pick only the TotalArrests collumn

library(dplyr)
select(data1,TotalArrests)

##                TotalArrests
## Alabama               328.4
## Alaska                365.5
## Arizona               413.1
## Arkansas              268.3
## California            416.6
## Colorado              328.6
## Connecticut           201.4
## Delaware              331.7
## Florida               462.3
## Georgia               314.2
## Hawaii                154.5
## Idaho                 190.8
## Illinois              366.4
## Indiana               206.2
## Iowa                  126.5
## Kansas                205.0
## Kentucky              187.0
## Louisiana             352.6
## Maine                 143.9
## Maryland              406.1
## Massachusetts         254.7
## Michigan              376.2
## Minnesota             155.6
## Mississippi           336.2
## Missouri              285.2
## Montana               184.4
## Nebraska              184.8
## Nevada                391.2
## New Hampshire         124.6
## New Jersey            274.2
## New Mexico            398.5
## New York              377.2
## North Carolina        411.1
## North Dakota           97.1
## Ohio                  223.7
## Oklahoma              245.6
## Oregon                260.2
## Pennsylvania          199.2
## Rhode Island          272.7
## South Carolina        363.9
## South Dakota          147.6
## Tennessee             287.1
## Texas                 319.2
## Utah                  226.1
## Vermont                93.4
## Virginia              248.2
## Washington            248.2
## West Virginia         135.0
## Wisconsin             132.4
## Wyoming               243.4

v)summarise

summarise() is able to do a summary of your dataset,for example,here i want to see the mean for the TotalArrests collum.

library(dplyr)
summarise(data1,Totalmean=mean(TotalArrests))

##   Totalmean
## 1    265.32