I imported a library here named ‘dslabs’
library(dslabs)
Loading the dataset named ‘murders’
data(murders)
murders
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
## 7 Connecticut CT Northeast 3574097 97
## 8 Delaware DE South 897934 38
## 9 District of Columbia DC South 601723 99
## 10 Florida FL South 19687653 669
## 11 Georgia GA South 9920000 376
## 12 Hawaii HI West 1360301 7
## 13 Idaho ID West 1567582 12
## 14 Illinois IL North Central 12830632 364
## 15 Indiana IN North Central 6483802 142
## 16 Iowa IA North Central 3046355 21
## 17 Kansas KS North Central 2853118 63
## 18 Kentucky KY South 4339367 116
## 19 Louisiana LA South 4533372 351
## 20 Maine ME Northeast 1328361 11
## 21 Maryland MD South 5773552 293
## 22 Massachusetts MA Northeast 6547629 118
## 23 Michigan MI North Central 9883640 413
## 24 Minnesota MN North Central 5303925 53
## 25 Mississippi MS South 2967297 120
## 26 Missouri MO North Central 5988927 321
## 27 Montana MT West 989415 12
## 28 Nebraska NE North Central 1826341 32
## 29 Nevada NV West 2700551 84
## 30 New Hampshire NH Northeast 1316470 5
## 31 New Jersey NJ Northeast 8791894 246
## 32 New Mexico NM West 2059179 67
## 33 New York NY Northeast 19378102 517
## 34 North Carolina NC South 9535483 286
## 35 North Dakota ND North Central 672591 4
## 36 Ohio OH North Central 11536504 310
## 37 Oklahoma OK South 3751351 111
## 38 Oregon OR West 3831074 36
## 39 Pennsylvania PA Northeast 12702379 457
## 40 Rhode Island RI Northeast 1052567 16
## 41 South Carolina SC South 4625364 207
## 42 South Dakota SD North Central 814180 8
## 43 Tennessee TN South 6346105 219
## 44 Texas TX South 25145561 805
## 45 Utah UT West 2763885 22
## 46 Vermont VT Northeast 625741 2
## 47 Virginia VA South 8001024 250
## 48 Washington WA West 6724540 93
## 49 West Virginia WV South 1852994 27
## 50 Wisconsin WI North Central 5686986 97
## 51 Wyoming WY West 563626 5
Firstly, extract the state names from murders dataset then I stored them in a vector named state_names.
Formula for creating a vector: \[ statenames <- c(-,-,-,-,-) \]
murders[,1]
## [1] "Alabama" "Alaska" "Arizona"
## [4] "Arkansas" "California" "Colorado"
## [7] "Connecticut" "Delaware" "District of Columbia"
## [10] "Florida" "Georgia" "Hawaii"
## [13] "Idaho" "Illinois" "Indiana"
## [16] "Iowa" "Kansas" "Kentucky"
## [19] "Louisiana" "Maine" "Maryland"
## [22] "Massachusetts" "Michigan" "Minnesota"
## [25] "Mississippi" "Missouri" "Montana"
## [28] "Nebraska" "Nevada" "New Hampshire"
## [31] "New Jersey" "New Mexico" "New York"
## [34] "North Carolina" "North Dakota" "Ohio"
## [37] "Oklahoma" "Oregon" "Pennsylvania"
## [40] "Rhode Island" "South Carolina" "South Dakota"
## [43] "Tennessee" "Texas" "Utah"
## [46] "Vermont" "Virginia" "Washington"
## [49] "West Virginia" "Wisconsin" "Wyoming"
state_names <- c(murders[,1])
print(state_names)
## [1] "Alabama" "Alaska" "Arizona"
## [4] "Arkansas" "California" "Colorado"
## [7] "Connecticut" "Delaware" "District of Columbia"
## [10] "Florida" "Georgia" "Hawaii"
## [13] "Idaho" "Illinois" "Indiana"
## [16] "Iowa" "Kansas" "Kentucky"
## [19] "Louisiana" "Maine" "Maryland"
## [22] "Massachusetts" "Michigan" "Minnesota"
## [25] "Mississippi" "Missouri" "Montana"
## [28] "Nebraska" "Nevada" "New Hampshire"
## [31] "New Jersey" "New Mexico" "New York"
## [34] "North Carolina" "North Dakota" "Ohio"
## [37] "Oklahoma" "Oregon" "Pennsylvania"
## [40] "Rhode Island" "South Carolina" "South Dakota"
## [43] "Tennessee" "Texas" "Utah"
## [46] "Vermont" "Virginia" "Washington"
## [49] "West Virginia" "Wisconsin" "Wyoming"
Same goes here as I stored the second column of murders named abbreviations in a vector.
abbreviation <- c(murders[,2])
print(abbreviation)
## [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "DC" "FL" "GA" "HI" "ID" "IL" "IN"
## [16] "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH"
## [31] "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT"
## [46] "VT" "VA" "WA" "WV" "WI" "WY"
Total Population means we have to sum up all the population figures in every state present in our dataset so after summing them all up we have the total population To find the total of values or summing them up, we usually use the sum() function
population <- c(murders[,4])
total_population <- sum(population)
print(total_population)
## [1] 309864228
The same goes here as we sum up all the dead murdered people to get a number which tells us the total murders happened
kills <- c(murders[,5])
print(sum(kills))
## [1] 9403
In programming and data analysis, a dataframe is a fundamental data structure used for organizing and storing data in a tabular format, similar to a spreadsheet or a database table. Dataframes are commonly used in languages like R
\[data.frame(..., row.names = NULL, check.rows = FALSE, check.names = TRUE, fix.empty.names = TRUE, stringsAsFactors = FALSE)\]
Description:
which.max()
Determines the location, i.e., index of the (first) minimum or maximum of a numeric (or logical) vector.\[which.max(x)\]
state_data <- data.frame(State = state_names, Population = population)
highest_population <- which.max(state_data$Population)
state_with_highest_population <- state_data$State[highest_population]
cat("The state with the highest population is:", state_with_highest_population)
## The state with the highest population is: California
\[which.max(x)\]
Concatenate:
\[ cat(… , file = "", sep = " ", fill = FALSE, labels = NULL, append = FALSE) \]
I does the same here as i did previously. Just created a new dataframe with necessary data and then extracted the max murders index from it and then access the index later on to get the value of it
state_data_1 <- data.frame(State = state_names, Murder=kills)
highest_murders <- which.max(state_data_1$Murder)
state_with_highest_murders <- state_data$State[highest_murders]
cat("The state with the highest murders is:", state_with_highest_murders)
## The state with the highest murders is: California
\[which.min(x)\]
state_data_1 <- data.frame(State = state_names, Murder=kills)
lowest_murders <- which.min(state_data_1$Murder)
state_with_lowest_murders <- state_data_1$State[lowest_murders]
cat("The state with the lowest murders is:", state_with_lowest_murders)
## The state with the lowest murders is: Vermont
Correlation is a statistical measure that quantifies the degree to which two or more variables are related or associated. It assesses whether there is a statistical relationship between the variables and the strength and direction of that relationship. Correlation is used to describe how changes in one variable are related to changes in another variable. \[cor(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman"))\]
\[ r = \frac{n \sum{(xy)} - \sum{x} \sum{y}}{\sqrt{[n \sum{(x^2)} - (\sum{x})^2][n \sum{(y^2)} - (\sum{y})^2]}} \]
A correlation coefficient of 0.9635956 indicates a very strong positive linear relationship between two variables.
correlation <- cor(population, kills)
cat("Correlation between population and murders:", correlation)
## Correlation between population and murders: 0.9635956
#Another Method
x<-population
y<-kills
n<-length(x)
sumxy<-sum(x*y)
sumx<-sum(x)
sumy<-sum(y)
sumofxsqr<-sum(x^2)
sumofysqr<-sum(y^2)
nom<-(n*sumxy)-(sumx*sumy)
den<-(((n*sumofxsqr)-(sumx)^2)*((n*sumofysqr)-(sumy)^2))^0.5
r<-nom/den
cat("\nCorrelation between Population and Murders from Another method:",r)
##
## Correlation between Population and Murders from Another method: 0.9635956
The mean, often referred to as the “average,” is a fundamental concept in statistics and mathematics. It is a measure of central tendency that provides a way to summarize a set of data points into a single representative value. The mean is calculated by summing all the values in a dataset and then dividing that sum by the total number of data points.
\[ \bar{x} = \frac{1}{n} \sum{x} \]
The linear regression line, often referred to as the “regression line” or “best-fit line,” is a key component of linear regression analysis. It represents the linear relationship between a dependent variable (‘y’) and one or more independent variables (‘x’) by defining a straight-line equation that provides the best fit to the observed data points.
In the context of simple linear regression (where there is only one independent variable), the equation of the linear regression line is typically represented as:
In this equation:
y represents the dependent variable (the variable to be predicted or explained).
x represents the independent variable (the variable used for prediction or explanation).
a is the intercept of the line, which represents the predicted value of ‘y’ when ‘x’ is equal to zero.
b is the slope of the line, which represents the rate of change in ‘y’ for a one-unit change in ‘x.’
The linear regression line is determined through a process that minimizes the sum of squared differences between the observed data points and the values predicted by the line. This line is chosen because it provides the best linear approximation of the relationship between ‘x’ and ‘y’ in the least-squares sense.
The linear regression line is a tool for making predictions and understanding how changes in the independent variable(s) relate to changes in the dependent variable. It provides a mathematical model that quantifies this relationship in a linear form.
\[ y=a+b.x \]
\[ b1 = \frac{{\sum(x - \bar{x})(y - \bar{y})}}{{\sum(x - \bar{x})^2}} \]
\[ b0 = \bar{y} - b \cdot \bar{x} \]
mean_population <- sum(population) / length(population)
mean_murders <- sum(kills) / length(kills)
b1 <- sum((population - mean_population) * (kills - mean_murders)) / sum((population - mean_population)^2)
b0 <- mean_murders - b1 * mean_population
# Create a formula for the regression line
formula_regression <- paste("Number_of_Murders =", round(b0, 2), "+", round(b1, 2), " * Population")
# Print the formula
cat("Regression Line:", formula_regression)
## Regression Line: Number_of_Murders = -17.13 + 0 * Population
Here, I firstly extracted the Population with the state name “Washington” and then stored that in washington_population and then simply print them using the concatenate function.
washington_population <- state_data$Population[state_data$State == "Washington"]
cat("The population of Washington is:", washington_population)
## The population of Washington is: 6724540
Same goes here as i extracted the murders in the state named ALaska and then print them after storing those in alaska_murders.
alaska_murders <- state_data_1$Murder[state_data_1$State=="Alaska"]
cat("Murders in Alaska are:",alaska_murders)
## Murders in Alaska are: 19