DA_DIS#3

Part 1
- Atomic Classes:
  - Logical: Contains three values: True, False, and NA.
  - Integer: Contains values that are whole numbers such as 1,2,3…etc.
  - Numeric: Contains all types of numbers. The number can be whole or in digits such as 1.23, 2.56, 20, 98.7
  - Character: Contains strings and other arbitrary combinations of items such as “a”,“discussion”,“Weekend”
- Other Data Structure:
  - Vector: In R, there are two types of vectors. One is called logical vectors which gives only three different values; TRUE, FALSE, and NA. Another type is called numerical vectors which only contains integers and double vectors such as v(1, 2, 3, 4) or v(“Pin”, “Professor”, “Data_Analysis”, “Discussion”)
  - list: It can contain any types of other data types; numbers, strings, and vectors. List can also contain other lists. An example would be x = [x[20]]
  - Matrix: All elements in a matrix all have the same data type. It can hold numeric, character or logical values.
  - Data frame: The size of data frame is much larger than that of a matrix. It’s more like a spreadsheet of data. It’s the de factor data structure for most tabular data that we use in R or use for statistics.
  - Factors: It’s used to store categorical data such as month, sex, age…etc. It can be in order or unordered.
  - Tables: It is very similar to a data frame. Tables are often essential for organzing and summarizing your data, especially with categorical variables.

Part 2

# Generate random vector with 7 elements
v <- sample(1:100,7)
v

## [1] 43 70 87 81 56 57 45

# Calculate std using in-built command
R_StandardDeviation_InBuilt <- sd(v)
R_StandardDeviation_InBuilt

## [1] 17.11446

# Calculate std by hand 
sqrt(sum((v-mean(v))^2/(length(v)-1)))

## [1] 17.11446

From above, we can see that the calculation of standard deviation gives the same result as the R in-built command of standard deviation.

Part 3

mad

## function (x, center = median(x), constant = 1.4826, na.rm = FALSE, 
##     low = FALSE, high = FALSE) 
## {
##     if (na.rm) 
##         x <- x[!is.na(x)]
##     n <- length(x)
##     constant * if ((low || high) && n%%2 == 0) {
##         if (low && high) 
##             stop("'low' and 'high' cannot be both TRUE")
##         n2 <- n%/%2 + as.integer(high)
##         sort(abs(x - center), partial = n2)[n2]
##     }
##     else median(abs(x - center))
## }
## <bytecode: 0x14ebae410>
## <environment: namespace:stats>

I think this command is for “mean-absolute deviation” because from the last row of code, it’s the formula for calculating MAD.

median(abs(x - center))

sd

## function (x, na.rm = FALSE) 
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
##     na.rm = na.rm))
## <bytecode: 0x13f883aa8>
## <environment: namespace:stats>

I think this command fucntion is the standard deviation because we used it in the previous part of this assignment. However, after unpacking this command function, it is confirmed that this is the standard deviation command becasue it has the following code which is the formula for calculating the std.

sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x)

** Now, I will be constructing my own functions which would convert metes to foot.

meter_to_foot <- function(length_m){
  length_f <- (length_m * 3.28)
  return(length_f)
}

** Testing the constructed command function

#the altitude of mount everest, meters -> feet 
meter_to_foot(8850)

## [1] 29028

The altitude of Mount Everest found on Google in ft is 29,028 which is the same as our result computed here.

Part 4

# Data selection
USArrests

##                Murder Assault UrbanPop Rape
## Alabama          13.2     236       58 21.2
## Alaska           10.0     263       48 44.5
## Arizona           8.1     294       80 31.0
## Arkansas          8.8     190       50 19.5
## California        9.0     276       91 40.6
## Colorado          7.9     204       78 38.7
## Connecticut       3.3     110       77 11.1
## Delaware          5.9     238       72 15.8
## Florida          15.4     335       80 31.9
## Georgia          17.4     211       60 25.8
## Hawaii            5.3      46       83 20.2
## Idaho             2.6     120       54 14.2
## Illinois         10.4     249       83 24.0
## Indiana           7.2     113       65 21.0
## Iowa              2.2      56       57 11.3
## Kansas            6.0     115       66 18.0
## Kentucky          9.7     109       52 16.3
## Louisiana        15.4     249       66 22.2
## Maine             2.1      83       51  7.8
## Maryland         11.3     300       67 27.8
## Massachusetts     4.4     149       85 16.3
## Michigan         12.1     255       74 35.1
## Minnesota         2.7      72       66 14.9
## Mississippi      16.1     259       44 17.1
## Missouri          9.0     178       70 28.2
## Montana           6.0     109       53 16.4
## Nebraska          4.3     102       62 16.5
## Nevada           12.2     252       81 46.0
## New Hampshire     2.1      57       56  9.5
## New Jersey        7.4     159       89 18.8
## New Mexico       11.4     285       70 32.1
## New York         11.1     254       86 26.1
## North Carolina   13.0     337       45 16.1
## North Dakota      0.8      45       44  7.3
## Ohio              7.3     120       75 21.4
## Oklahoma          6.6     151       68 20.0
## Oregon            4.9     159       67 29.3
## Pennsylvania      6.3     106       72 14.9
## Rhode Island      3.4     174       87  8.3
## South Carolina   14.4     279       48 22.5
## South Dakota      3.8      86       45 12.8
## Tennessee        13.2     188       59 26.9
## Texas            12.7     201       80 25.5
## Utah              3.2     120       80 22.9
## Vermont           2.2      48       32 11.2
## Virginia          8.5     156       63 20.7
## Washington        4.0     145       73 26.2
## West Virginia     5.7      81       39  9.3
## Wisconsin         2.6      53       66 10.8
## Wyoming           6.8     161       60 15.6

library(ggplot2)
library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(moments)

# More detail on the data "Murder"
describe(USArrests$Murder)

##    vars  n mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 50 7.79 4.36   7.25    7.53 5.41 0.8 17.4  16.6 0.37    -0.95 0.62

# Basic density
p <- ggplot(USArrests, aes(x=Murder)) + 
  geom_density(color="darkgreen", fill="lightgreen")

# Add mean line
p+ geom_vline(aes(xintercept=mean(Murder)),
            color="black", linetype="dashed", size=2)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Based on the graph produced above on Murder, I think the data has a positive skewness which means that there are some extreme values on the right hand side of the distribution that is pulling the distribution skewing to the right.

** A package called “moments” can enable us to calculate skewness in simple commands.

# Find the skewness using "moments" package
skewness(USArrests$Murder)

## [1] 0.3820378

Based on the coefficient we get from above, 0.38, we can say that the distribution is approximately symmetric because the skewness coefficient is between - 0.5 and 0.5. Normally, we say that if skewness is less than -1, or greater than 1, the distribution is highly skewed. If the distribution is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed.

DA_DIS#3

Pin Lyu

2023-09-15