We focus on USArrests data. Official documentation states that the dataset contains 50 rows for every US state and the data from the year 1973 for:
This doesn’t quite satisfy the condition of the task. We need a categorical variable as well. It is well known that states dataset contains demographic data for US states in R. We need to make sure that the data we add was relevant in 1973. For example, it makes no sense to compare the income data from 1985 with the crime data from the year of 1973. The data are normalized per 100.000 inhabitants, which makes them comparable.
library(datasets)
library(car)
## Loading required package: carData
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
library(readxl)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
dataset <- datasets::USArrests #this is the dataset we're going to work with
rownames(dataset) #we see that the state is actually a row name, not a column. Let's add it.
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
dataset$State<-rownames(dataset) #...and let's remove the rownames, function cbind(...)
rownames(dataset) = NULL #let us add the region the state is in. This information we can derive from the state.region dataset
temp <-data.frame(State = datasets::state.name, Region = datasets::state.region) #Now we can merge the data frames because we have the State column
merged <- merge(dataset, temp, by = "State")
head(merged) # prints first 6 rows
## State Murder Assault UrbanPop Rape Region
## 1 Alabama 13.2 236 58 21.2 South
## 2 Alaska 10.0 263 48 44.5 West
## 3 Arizona 8.1 294 80 31.0 West
## 4 Arkansas 8.8 190 50 19.5 South
## 5 California 9.0 276 91 40.6 West
## 6 Colorado 7.9 204 78 38.7 West
#Note that region is a factor with 4 levels, not a character column.
# Rename the column UrbanPop to Urban
merged$Urban <- merged$UrbanPop
merged$UrbanPop <- NULL
# Create new columns
merged$TotalArr <- merged$Murder + merged$Assault + merged$Rape
merged$TAUR <- merged$TotalArr / merged$Urban
merged$M2 <- merged$Murder * merged$Murder
# Display the first few rows of the dataset
print(head(merged))
## State Murder Assault Rape Region Urban TotalArr TAUR M2
## 1 Alabama 13.2 236 21.2 South 58 270.4 4.662069 174.24
## 2 Alaska 10.0 263 44.5 West 48 317.5 6.614583 100.00
## 3 Arizona 8.1 294 31.0 West 80 333.1 4.163750 65.61
## 4 Arkansas 8.8 190 19.5 South 50 218.3 4.366000 77.44
## 5 California 9.0 276 40.6 West 91 325.6 3.578022 81.00
## 6 Colorado 7.9 204 38.7 West 78 250.6 3.212821 62.41
# Create Safest dataset
Safest <- merged[order(merged$TotalArr), ]
Safest <- Safest[1:3, ]
print(head(Safest))
## State Murder Assault Rape Region Urban TotalArr TAUR M2
## 34 North Dakota 0.8 45 7.3 North Central 44 53.1 1.206818 0.64
## 45 Vermont 2.2 48 11.2 Northeast 32 61.4 1.918750 4.84
## 49 Wisconsin 2.6 53 10.8 North Central 66 66.4 1.006061 6.76
# Create Dangerous dataset
Dangerous <- merged[order(-merged$TotalArr), ]
Dangerous <- Dangerous[1:3, ]
print(head(Dangerous))
## State Murder Assault Rape Region Urban TotalArr TAUR M2
## 9 Florida 15.4 335 31.9 South 80 382.3 4.778750 237.16
## 33 North Carolina 13.0 337 16.1 South 45 366.1 8.135556 169.00
## 20 Maryland 11.3 300 27.8 South 67 339.1 5.061194 127.69
# Remove M2 column
merged$M2 <- NULL
print(head(merged))
## State Murder Assault Rape Region Urban TotalArr TAUR
## 1 Alabama 13.2 236 21.2 South 58 270.4 4.662069
## 2 Alaska 10.0 263 44.5 West 48 317.5 6.614583
## 3 Arizona 8.1 294 31.0 West 80 333.1 4.163750
## 4 Arkansas 8.8 190 19.5 South 50 218.3 4.366000
## 5 California 9.0 276 40.6 West 91 325.6 3.578022
## 6 Colorado 7.9 204 38.7 West 78 250.6 3.212821
Regarding Summary statistics, we can use a base function summary for the basic summary statistics. However, if we need a more detailed approach, the package psych has a neat function describe which encompases the (usual) descriptive statistics in the one function.
num <- merged[, names(merged) != "Region"] # Remove categorical variables, pt 1
num <- num[, names(num) != "State"] # Remove categorical variables, pt 2
descStat <-describe(num)
print(descStat)
## vars n mean sd median trimmed mad min max range skew
## Murder 1 50 7.79 4.36 7.25 7.53 5.41 0.80 17.40 16.60 0.37
## Assault 2 50 170.76 83.34 159.00 168.48 110.45 45.00 337.00 292.00 0.22
## Rape 3 50 21.23 9.37 20.10 20.36 8.60 7.30 46.00 38.70 0.75
## Urban 4 50 65.54 14.47 66.00 65.88 17.79 32.00 91.00 59.00 -0.21
## TotalArr 5 50 199.78 93.37 185.20 198.02 116.46 53.10 382.30 329.20 0.17
## TAUR 6 50 3.12 1.59 2.75 2.93 1.32 0.86 8.14 7.27 1.05
## kurtosis se
## Murder -0.95 0.62
## Assault -1.15 11.79
## Rape 0.08 1.32
## Urban -0.87 2.05
## TotalArr -1.24 13.20
## TAUR 0.86 0.23
There’s no missing data. Assaults contribute the most to the total arrest numbers, but there seem to be no outliers, since means and medians are very close, and since trimmed mean (at 0.1 level) and mean are close. The TAUR data are the most skewed to the left, while from the kurtosis coefficient, we conclude that the rapes are the closest to the normal bell shape and the data aren’t “flattened” or “elongated”.
merged1 <- merged
merged1$Abbr <- state.abb # Adding the State Abbreviations for easier readability of plots
head(merged1)
## State Murder Assault Rape Region Urban TotalArr TAUR Abbr
## 1 Alabama 13.2 236 21.2 South 58 270.4 4.662069 AL
## 2 Alaska 10.0 263 44.5 West 48 317.5 6.614583 AK
## 3 Arizona 8.1 294 31.0 West 80 333.1 4.163750 AZ
## 4 Arkansas 8.8 190 19.5 South 50 218.3 4.366000 AR
## 5 California 9.0 276 40.6 West 91 325.6 3.578022 CA
## 6 Colorado 7.9 204 38.7 West 78 250.6 3.212821 CO
# List of numeric variables to plot: "Murder" "Assault" "Rape" "Urban" "TotalArr" "TAUR"
# Create individual scatterplots, we need to rotate the labels on the x-axis by 90 degrees.
p1 <- ggplot(merged1, aes(x = Abbr, y = Murder, color = Region)) +
geom_point() +
labs(x = "", y = "Murder", title = "Plot of Murder by State") +
theme(axis.text.x = element_text(angle = 90))
print(p1)
p2 <- ggplot(merged1, aes(x = Abbr, y = Assault, color = Region)) +
geom_point() +
labs(x = "", y = "Assault", title = "Plot of Assault by State") +
theme(axis.text.x = element_text(angle = 90))
print(p2)
p3 <- ggplot(merged1, aes(x = Abbr, y = Rape, color = Region)) +
geom_point() +
labs(x = "", y = "Rape", title = "Plot of Rape by State") +
theme(axis.text.x = element_text(angle = 90))
print(p3)
p4 <- ggplot(merged1, aes(x = Abbr, y = Urban, color = Region)) +
geom_point() +
labs(x = "", y = "Urban", title = "Plot of Urban by State") +
theme(axis.text.x = element_text(angle = 90))
print(p4)
p5 <- ggplot(merged1, aes(x = Abbr, y = TotalArr, color = Region)) +
geom_point() +
labs(x = "", y = "Total Arrests", title = "Plot of Total Arrests by State") +
theme(axis.text.x = element_text(angle = 90))
print(p5)
p6 <- ggplot(merged1, aes(x = Abbr, y = TAUR, color = Region)) +
geom_point() +
labs(x = "", y = "TAUR", title = "Plot of TAUR by State") +
theme(axis.text.x = element_text(angle = 90))
print(p6)
# We generate boxplots per Region
p1 <- ggplot(merged1, aes(x = Region, y = Murder, fill = Region)) +
geom_boxplot() +
labs(x = "", y = "Murder", title = "Boxplot of Murder by Region")
print(p1)
p2 <- ggplot(merged1, aes(x = Region, y = Assault, fill = Region)) +
geom_boxplot() +
labs(x = "", y = "Assault", title = "Boxplot of Assault by Region")
print(p2)
p3 <- ggplot(merged1, aes(x = Region, y = Rape, fill = Region)) +
geom_boxplot() +
labs(x = "", y = "Rape", title = "Boxplot of Rape by Region")
print(p3)
p4 <- ggplot(merged1, aes(x = Region, y = Urban, fill = Region)) +
geom_boxplot() +
labs(x = "", y = "Urban", title = "Boxplot of Urban by Region")
print(p4)
p5 <- ggplot(merged1, aes(x = Region, y = TotalArr, fill = Region)) +
geom_boxplot() +
labs(x = "", y = "Total Arrests", title = "Boxplot of Total Arrests by Region")
print(p5)
p6 <- ggplot(merged1, aes(x = Region, y = TAUR, fill = Region)) +
geom_boxplot() +
labs(x = "", y = "TAUR", title = "Boxplot of TAUR by Region")
print(p6)
# Function to calculate binwidth using Diaconis-Freedman rule
FDR <- function(x) {
2 * IQR(x) / (length(x)^(1/3))
}
# Create individual histograms for each numeric variable
p1 <- ggplot(merged1, aes(x = Murder)) +
geom_histogram(binwidth = FDR(merged1[,2])) +
labs(x = "Murder", y = "Frequency", title = "Histogram of Murder")
print(p1)
p2 <- ggplot(merged1, aes(x = Assault)) +
geom_histogram(binwidth = FDR(merged1[,3])) +
labs(x = "Assault", y = "Frequency", title = "Histogram of Assault")
print(p2)
p3 <- ggplot(merged1, aes(x = Rape)) +
geom_histogram(binwidth = FDR(merged1[,4])) +
labs(x = "Rape", y = "Frequency", title = "Histogram of Rape")
print(p3)
p4 <- ggplot(merged1, aes(x = Urban)) +
geom_histogram(binwidth = FDR(merged1[,6])) +
labs(x = "Urban", y = "Frequency", title = "Histogram of Urban")
print(p4)
p5 <- ggplot(merged1, aes(x = TotalArr)) +
geom_histogram(binwidth = FDR(merged1[,7])) +
labs(x = "TotalArr", y = "Frequency", title = "Histogram of Total Arrests")
print(p5)
p6 <- ggplot(merged1, aes(x = TAUR)) +
geom_histogram(binwidth = FDR(merged1[,8])) +
labs(x = "TAUR", y = "Frequency", title = "Histogram of TAUR")
print(p6)
# Create a data frame with the count of states per region
reg<-table(merged1$Region)
region_counts <- as.data.frame(reg)
colnames(region_counts) <- c("Region", "Count")
# Plot the distribution of the number of states per region
p <- ggplot(region_counts, aes(x = Region, y = Count, fill = Region)) +
geom_bar(stat = "identity") +
labs(x = "Region", y = "Number of States", title = "Distribution of Number of States per Region")
print(p)
In the first part, we have created scatterplots, from which we can conclude that southern states dominate the number of assaults. Some western states have a large number of assaults as well. Southern states have the largest number of murders as well. On the other hand, western states dominate the rape scatterplot. Regarding urban development, one can see that the northeastern states are the most developed ones and have the smallest number of arrests. The connection between the two may be speculated to originate from the fact the police are overwhelmed in large cities and it’s generally easier to make an arrest in the rural area, conditional on finding the suspect’s whereabouts. Total arrests scatterplot shows us southern and western states dominate the plot. Regarding the TAUR statistic we invented, we can conclude southern states have the largest TAUR, while the north-central states have the lowest.
From the boxplots, we conclude there are no outliers in the number of murders. The number of assaults varies greatly in the west. North central has a low number of assaults in general, except the states of Illinois and Michigan, which have the number of assaults in the 250s range. The number of rapes varies the most in the west, being largest in Nevada, possibly because of Las Vegas, while being lowest in Idaho, which is the most rural western state. The urban status varies greatly in the northeast, being one of the highest for New Jersey, and being the lowest in the States for Vermont. Regarding the north central states, they are usually urban, except for the Dakotas. Regarding total arrests, the west has the largest range of data, being the highest for Arizona, which dominates the number of assaults, and the smallest for Hawaii, which is an island state full of tourists who don’t generally commit violent crimes. Regarding north central region, Michigan bears the highest arrest rate and the highest assault rates. TAUR has a few outliers. In the northeast, New York is the outlier, since it has the largest assault numbers, and hence the largest total arrests, while Urban is in the 80s range. New Hampshire is another outlier, bearing a very low number of total arrests and the urban development index in the 50s range. In the south, North Carolina has the largest TAUR, since it has the low urban development index and a large number of assaults. In the west, Alaska is the outlier, since it has the lowest urban development and a large assault number. This index can be useful for the federal government to decide on police force distribution in rural states with high assault rates.
The histograms tell us a bit more about the shape of the density, since the histogram serves as a density estimator. We need to specify the number of bins for the data, since the default is 30, which does not suit our needs. We use Freedman - Diaconis rule. We conclude right tail of murder data is heavy and there’s not a large number of states with low murder rates. The mode is somewhere in 5-10 range, but we conclude there’s a large number of the states having the large number of murders (10-15). Density of the assaults is monotonously decreasing, with a heavy left tail, which is expected. Histogram of rape is more or less bell-shaped, however right tail is long because Alaska, California and Nevada have rape arrests larger than 40. Urban development histogram has a heavier right tail, indicating the country is more urban than rural in general, which can be attributed to the economic development. Histogram of total arrests is flat, which is expected on a national level. Histogram of TAUR is bell-shaped with Alaska, Mississippi and Carolinas contributing to the longer right tail, since they have TAUR larger than 6. Regarding the state numbers, we see the number of southern states is the largest.The data per region seem to be well balanced.
We need to load the dataset.
# Load the dataset and name it task2
task2 <-read_excel("C:/R/Task 2/Business School.xlsx")
colnames(task2) <- gsub(" ", "_", colnames(task2)) # Spaces in column names aren't needed
Now do the same thing we did in the task above, but for the average salary.
# Create a data frame with the count of entries per Undergrad_Degree
und <- table(task2$Undergrad_Degree)
undergrad_counts <- as.data.frame(und)
colnames(undergrad_counts) <- c("Undergrad_Degree", "Count")
# Plot the distribution of the number of entries per Undergrad_Degree
p <- ggplot(undergrad_counts, aes(x = Undergrad_Degree, y = Count)) +
geom_bar(stat = "identity") +
labs(x = "Undergrad Degree", y = "Number of Entries", title = "Distribution of Number of Entries per Undergrad Degree")
print(p)
The largest number of graduates is in Business and that’s a mode of the data.
# Calculate descriptive statistics for Annual Salary
desc <- describe(task2$Annual_Salary)
rownames(desc) = "Annual Salary"
print(desc)
## vars n mean sd median trimmed mad min max
## Annual Salary 1 100 109058 41501.49 103500 104600.2 25945.5 20000 340000
## range skew kurtosis se
## Annual Salary 320000 2.22 9.41 4150.15
# Transform Annual Salary to thousands
task2$AST <- task2$Annual_Salary / 1000
binwidth_fdr <- FDR(task2$AST)
# Histogram for Annual Salary in thousands
p <- ggplot(task2, aes(x = AST)) +
geom_histogram(binwidth = binwidth_fdr)
print(p)
# Histogram for AST in thousands grouped by Undergrad Degree
p_undergrad <- ggplot(task2, aes(x = AST, fill = Undergrad_Degree)) +
geom_histogram(binwidth = binwidth_fdr, position = "dodge") + labs(title = "Histogram of AST (in Thousands) by Undergraduate Degree")
print(p_undergrad)
# Histogram for AST in thousands grouped by Work Experience
p_work_experience <- ggplot(task2, aes(x = AST, fill = Work_Experience)) +
geom_histogram(binwidth = binwidth_fdr, position = "dodge") + labs(title = "Histogram of AST (in Thousands) by Work Experience")
print(p_work_experience)
From the descriptive statistics, we deduce there are no missing data. The median and mean are significantly different, while the trimmed mean and the median aren’t. From the direction of the difference and the skewness coefficient, we conclude the data are left skewed and we have some outliers on the right. The range of the data also suggests that’s the case. Kurtosis coefficient also says the data is elongated, so we expect to have a large number of people earning the average wage, or at least somewhere around 104 000.
The data in this form aren’t great for analysis, so we decide to plot a histogram of salary data in thousands. The first histogram plot suggests what we have already suspected, while the histograms per profession suggest engineers and computer scientists will earn approximately an average wage, artists will approximately earn below average, while business and finance graduates will have their wages in a wide range. It can be seen that the undergraduate counts are non-balanced as well, but we disregard the fact in the previous analysis, since we focus on the general tendencies.
Prior experience is non-balanced, hence making conclusions is a slippery slope. From the plot we can say experience doesn’t play a significant role, however there is one outlier who has prior experience and is a business graduate, while the next highest earner is a finance graduate with no prior experience.
qqPlot(task2$MBA_Grade)
## [1] 92 68
shapiro.test(task2$MBA_Grade)
##
## Shapiro-Wilk normality test
##
## data: task2$MBA_Grade
## W = 0.98799, p-value = 0.5078
# Two-sided t-test for MBA_Grade equal to 74
t_test_two_sided <- t.test(task2$MBA_Grade, mu = 74, alternative = "two.sided")
print(t_test_two_sided)
##
## One Sample t-test
##
## data: task2$MBA_Grade
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
Prior to performing the t-test, it would be nice to test the hypothesis of data being normally distributed. From the Q-Q plot and the Shapiro test, we get the data are normally distributed and we can fully trust the t-test. The mean grade is not equal to 74 according to the t-test, since p-value is smaller than 0.05.
# Load the dataset and name it task3
task3 <- read_excel("C:/R/Task 3/Apartments.xlsx")
colnames(task3) <- gsub(" ", "_", colnames(task3)) # Spaces in column names aren't needed