Assignment-01-Statistical Software For Data Analysis-Data Science

Importing Library

I imported a library here named ‘dslabs’

library(dslabs)

Loading Dataset

Loading the dataset named ‘murders’

data(murders)
murders

##                   state abb        region population total
## 1               Alabama  AL         South    4779736   135
## 2                Alaska  AK          West     710231    19
## 3               Arizona  AZ          West    6392017   232
## 4              Arkansas  AR         South    2915918    93
## 5            California  CA          West   37253956  1257
## 6              Colorado  CO          West    5029196    65
## 7           Connecticut  CT     Northeast    3574097    97
## 8              Delaware  DE         South     897934    38
## 9  District of Columbia  DC         South     601723    99
## 10              Florida  FL         South   19687653   669
## 11              Georgia  GA         South    9920000   376
## 12               Hawaii  HI          West    1360301     7
## 13                Idaho  ID          West    1567582    12
## 14             Illinois  IL North Central   12830632   364
## 15              Indiana  IN North Central    6483802   142
## 16                 Iowa  IA North Central    3046355    21
## 17               Kansas  KS North Central    2853118    63
## 18             Kentucky  KY         South    4339367   116
## 19            Louisiana  LA         South    4533372   351
## 20                Maine  ME     Northeast    1328361    11
## 21             Maryland  MD         South    5773552   293
## 22        Massachusetts  MA     Northeast    6547629   118
## 23             Michigan  MI North Central    9883640   413
## 24            Minnesota  MN North Central    5303925    53
## 25          Mississippi  MS         South    2967297   120
## 26             Missouri  MO North Central    5988927   321
## 27              Montana  MT          West     989415    12
## 28             Nebraska  NE North Central    1826341    32
## 29               Nevada  NV          West    2700551    84
## 30        New Hampshire  NH     Northeast    1316470     5
## 31           New Jersey  NJ     Northeast    8791894   246
## 32           New Mexico  NM          West    2059179    67
## 33             New York  NY     Northeast   19378102   517
## 34       North Carolina  NC         South    9535483   286
## 35         North Dakota  ND North Central     672591     4
## 36                 Ohio  OH North Central   11536504   310
## 37             Oklahoma  OK         South    3751351   111
## 38               Oregon  OR          West    3831074    36
## 39         Pennsylvania  PA     Northeast   12702379   457
## 40         Rhode Island  RI     Northeast    1052567    16
## 41       South Carolina  SC         South    4625364   207
## 42         South Dakota  SD North Central     814180     8
## 43            Tennessee  TN         South    6346105   219
## 44                Texas  TX         South   25145561   805
## 45                 Utah  UT          West    2763885    22
## 46              Vermont  VT     Northeast     625741     2
## 47             Virginia  VA         South    8001024   250
## 48           Washington  WA          West    6724540    93
## 49        West Virginia  WV         South    1852994    27
## 50            Wisconsin  WI North Central    5686986    97
## 51              Wyoming  WY          West     563626     5

Compute state names and store them in a vector

Firstly, extract the state names from murders dataset then I stored them in a vector named state_names.

Formula for creating a vector: \[ statenames <- c(-,-,-,-,-) \]

murders[,1]

##  [1] "Alabama"              "Alaska"               "Arizona"             
##  [4] "Arkansas"             "California"           "Colorado"            
##  [7] "Connecticut"          "Delaware"             "District of Columbia"
## [10] "Florida"              "Georgia"              "Hawaii"              
## [13] "Idaho"                "Illinois"             "Indiana"             
## [16] "Iowa"                 "Kansas"               "Kentucky"            
## [19] "Louisiana"            "Maine"                "Maryland"            
## [22] "Massachusetts"        "Michigan"             "Minnesota"           
## [25] "Mississippi"          "Missouri"             "Montana"             
## [28] "Nebraska"             "Nevada"               "New Hampshire"       
## [31] "New Jersey"           "New Mexico"           "New York"            
## [34] "North Carolina"       "North Dakota"         "Ohio"                
## [37] "Oklahoma"             "Oregon"               "Pennsylvania"        
## [40] "Rhode Island"         "South Carolina"       "South Dakota"        
## [43] "Tennessee"            "Texas"                "Utah"                
## [46] "Vermont"              "Virginia"             "Washington"          
## [49] "West Virginia"        "Wisconsin"            "Wyoming"

state_names <- c(murders[,1])
print(state_names)

##  [1] "Alabama"              "Alaska"               "Arizona"             
##  [4] "Arkansas"             "California"           "Colorado"            
##  [7] "Connecticut"          "Delaware"             "District of Columbia"
## [10] "Florida"              "Georgia"              "Hawaii"              
## [13] "Idaho"                "Illinois"             "Indiana"             
## [16] "Iowa"                 "Kansas"               "Kentucky"            
## [19] "Louisiana"            "Maine"                "Maryland"            
## [22] "Massachusetts"        "Michigan"             "Minnesota"           
## [25] "Mississippi"          "Missouri"             "Montana"             
## [28] "Nebraska"             "Nevada"               "New Hampshire"       
## [31] "New Jersey"           "New Mexico"           "New York"            
## [34] "North Carolina"       "North Dakota"         "Ohio"                
## [37] "Oklahoma"             "Oregon"               "Pennsylvania"        
## [40] "Rhode Island"         "South Carolina"       "South Dakota"        
## [43] "Tennessee"            "Texas"                "Utah"                
## [46] "Vermont"              "Virginia"             "Washington"          
## [49] "West Virginia"        "Wisconsin"            "Wyoming"

Store the Abbreviations in a Separate Vector

Same goes here as I stored the second column of murders named abbreviations in a vector.

abbreviation <- c(murders[,2])
print(abbreviation)

##  [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "DC" "FL" "GA" "HI" "ID" "IL" "IN"
## [16] "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH"
## [31] "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT"
## [46] "VT" "VA" "WA" "WV" "WI" "WY"

What is the Total Population of USA?

Total Population means we have to sum up all the population figures in every state present in our dataset so after summing them all up we have the total population To find the total of values or summing them up, we usually use the sum() function

population <- c(murders[,4])
total_population <- sum(population)
print(total_population)

## [1] 309864228

How many murders have been done according to this dataset?

The same goes here as we sum up all the dead murdered people to get a number which tells us the total murders happened

kills <- c(murders[,5])
print(sum(kills))

## [1] 9403

Which state has a highest population?

In programming and data analysis, a dataframe is a fundamental data structure used for organizing and storing data in a tabular format, similar to a spreadsheet or a database table. Dataframes are commonly used in languages like R

\[data.frame(..., row.names = NULL, check.rows = FALSE, check.names = TRUE, fix.empty.names = TRUE, stringsAsFactors = FALSE)\]

Description:

which.max()

Determines the location, i.e., index of the (first) minimum or maximum of a numeric (or logical) vector.\[which.max(x)\]

state_data <- data.frame(State = state_names, Population = population)
highest_population <- which.max(state_data$Population)
state_with_highest_population <- state_data$State[highest_population]
cat("The state with the highest population is:", state_with_highest_population)

## The state with the highest population is: California

Which state has the highest number of murders?

\[which.max(x)\]

Concatenate:

\[ cat(… , file = "", sep = " ", fill = FALSE, labels = NULL, append = FALSE) \]

I does the same here as i did previously. Just created a new dataframe with necessary data and then extracted the max murders index from it and then access the index later on to get the value of it

state_data_1 <- data.frame(State = state_names, Murder=kills)
highest_murders <- which.max(state_data_1$Murder)
state_with_highest_murders <- state_data$State[highest_murders]
cat("The state with the highest murders is:", state_with_highest_murders)

## The state with the highest murders is: California

Which state has the lowest number of murders?

\[which.min(x)\]

state_data_1 <- data.frame(State = state_names, Murder=kills)
lowest_murders <- which.min(state_data_1$Murder)
state_with_lowest_murders <- state_data_1$State[lowest_murders]
cat("The state with the lowest murders is:", state_with_lowest_murders)

## The state with the lowest murders is: Vermont

Compute correlation between population and murders

Correlation is a statistical measure that quantifies the degree to which two or more variables are related or associated. It assesses whether there is a statistical relationship between the variables and the strength and direction of that relationship. Correlation is used to describe how changes in one variable are related to changes in another variable. \[cor(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman"))\]

\[ r = \frac{n \sum{(xy)} - \sum{x} \sum{y}}{\sqrt{[n \sum{(x^2)} - (\sum{x})^2][n \sum{(y^2)} - (\sum{y})^2]}} \]

A correlation coefficient of 0.9635956 indicates a very strong positive linear relationship between two variables.

correlation <- cor(population, kills)
cat("Correlation between population and murders:", correlation)

## Correlation between population and murders: 0.9635956

#Another Method
x<-population
y<-kills
n<-length(x)
sumxy<-sum(x*y)
sumx<-sum(x)
sumy<-sum(y)
sumofxsqr<-sum(x^2)
sumofysqr<-sum(y^2)
nom<-(n*sumxy)-(sumx*sumy)
den<-(((n*sumofxsqr)-(sumx)^2)*((n*sumofysqr)-(sumy)^2))^0.5
r<-nom/den
cat("\nCorrelation between Population and Murders from Another method:",r)

## 
## Correlation between Population and Murders from Another method: 0.9635956

Fit a regression line of population on number of murders

The mean, often referred to as the “average,” is a fundamental concept in statistics and mathematics. It is a measure of central tendency that provides a way to summarize a set of data points into a single representative value. The mean is calculated by summing all the values in a dataset and then dividing that sum by the total number of data points.

\[ \bar{x} = \frac{1}{n} \sum{x} \]

The linear regression line, often referred to as the “regression line” or “best-fit line,” is a key component of linear regression analysis. It represents the linear relationship between a dependent variable (‘y’) and one or more independent variables (‘x’) by defining a straight-line equation that provides the best fit to the observed data points.

In the context of simple linear regression (where there is only one independent variable), the equation of the linear regression line is typically represented as:

In this equation:

y represents the dependent variable (the variable to be predicted or explained).
x represents the independent variable (the variable used for prediction or explanation).
a is the intercept of the line, which represents the predicted value of ‘y’ when ‘x’ is equal to zero.
b is the slope of the line, which represents the rate of change in ‘y’ for a one-unit change in ‘x.’

The linear regression line is determined through a process that minimizes the sum of squared differences between the observed data points and the values predicted by the line. This line is chosen because it provides the best linear approximation of the relationship between ‘x’ and ‘y’ in the least-squares sense.

The linear regression line is a tool for making predictions and understanding how changes in the independent variable(s) relate to changes in the dependent variable. It provides a mathematical model that quantifies this relationship in a linear form.

\[ y=a+b.x \]

\[ b1 = \frac{{\sum(x - \bar{x})(y - \bar{y})}}{{\sum(x - \bar{x})^2}} \]

\[ b0 = \bar{y} - b \cdot \bar{x} \]

mean_population <- sum(population) / length(population)
mean_murders <- sum(kills) / length(kills)
b1 <- sum((population - mean_population) * (kills - mean_murders)) / sum((population - mean_population)^2)
b0 <- mean_murders - b1 * mean_population

# Create a formula for the regression line
formula_regression <- paste("Number_of_Murders =", round(b0, 2), "+", round(b1, 2), " * Population")
# Print the formula
cat("Regression Line:", formula_regression)

## Regression Line: Number_of_Murders = -17.13 + 0  * Population

Display the population of Washington

Here, I firstly extracted the Population with the state name “Washington” and then stored that in washington_population and then simply print them using the concatenate function.

washington_population <- state_data$Population[state_data$State == "Washington"]
cat("The population of Washington is:", washington_population)

## The population of Washington is: 6724540

Display the number of murders in Alaska

Same goes here as i extracted the murders in the state named ALaska and then print them after storing those in alaska_murders.

alaska_murders <- state_data_1$Murder[state_data_1$State=="Alaska"]
cat("Murders in Alaska are:",alaska_murders)

## Murders in Alaska are: 19