Task 1

We focus on USArrests data. Official documentation states that the dataset contains 50 rows for every US state and the data from the year 1973 for:

This doesn’t quite satisfy the condition of the task. We need a categorical variable as well. It is well known that states dataset contains demographic data for US states in R. We need to make sure that the data we add was relevant in 1973. For example, it makes no sense to compare the income data from 1985 with the crime data from the year of 1973. The data are normalized per 100.000 inhabitants, which makes them comparable.

library(datasets)
library(car)
## Loading required package: carData
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit
library(readxl)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
dataset <- datasets::USArrests #this is the dataset we're going to work with
rownames(dataset) #we see that the state is actually a row name, not a column. Let's add it.
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"
dataset$State<-rownames(dataset) #...and let's remove the rownames, function cbind(...)
rownames(dataset) = NULL #let us add the region the state is in. This information we can derive from the state.region dataset
temp <-data.frame(State = datasets::state.name, Region = datasets::state.region) #Now we can merge the data frames because we have the State column
merged <- merge(dataset, temp, by = "State")
head(merged) # prints first 6 rows
##        State Murder Assault UrbanPop Rape Region
## 1    Alabama   13.2     236       58 21.2  South
## 2     Alaska   10.0     263       48 44.5   West
## 3    Arizona    8.1     294       80 31.0   West
## 4   Arkansas    8.8     190       50 19.5  South
## 5 California    9.0     276       91 40.6   West
## 6   Colorado    7.9     204       78 38.7   West
#Note that region is a factor with 4 levels, not a character column.
# Rename the column UrbanPop to Urban
merged$Urban <- merged$UrbanPop
merged$UrbanPop <- NULL

# Create new columns
merged$TotalArr <- merged$Murder + merged$Assault + merged$Rape
merged$TAUR <- merged$TotalArr / merged$Urban
merged$M2 <- merged$Murder * merged$Murder

# Display the first few rows of the dataset
print(head(merged))
##        State Murder Assault Rape Region Urban TotalArr     TAUR     M2
## 1    Alabama   13.2     236 21.2  South    58    270.4 4.662069 174.24
## 2     Alaska   10.0     263 44.5   West    48    317.5 6.614583 100.00
## 3    Arizona    8.1     294 31.0   West    80    333.1 4.163750  65.61
## 4   Arkansas    8.8     190 19.5  South    50    218.3 4.366000  77.44
## 5 California    9.0     276 40.6   West    91    325.6 3.578022  81.00
## 6   Colorado    7.9     204 38.7   West    78    250.6 3.212821  62.41
# Create Safest dataset
Safest <- merged[order(merged$TotalArr), ]
Safest <- Safest[1:3, ]
print(head(Safest))
##           State Murder Assault Rape        Region Urban TotalArr     TAUR   M2
## 34 North Dakota    0.8      45  7.3 North Central    44     53.1 1.206818 0.64
## 45      Vermont    2.2      48 11.2     Northeast    32     61.4 1.918750 4.84
## 49    Wisconsin    2.6      53 10.8 North Central    66     66.4 1.006061 6.76
# Create Dangerous dataset
Dangerous <- merged[order(-merged$TotalArr), ]
Dangerous <- Dangerous[1:3, ]
print(head(Dangerous))
##             State Murder Assault Rape Region Urban TotalArr     TAUR     M2
## 9         Florida   15.4     335 31.9  South    80    382.3 4.778750 237.16
## 33 North Carolina   13.0     337 16.1  South    45    366.1 8.135556 169.00
## 20       Maryland   11.3     300 27.8  South    67    339.1 5.061194 127.69
# Remove M2 column
merged$M2 <- NULL
print(head(merged))
##        State Murder Assault Rape Region Urban TotalArr     TAUR
## 1    Alabama   13.2     236 21.2  South    58    270.4 4.662069
## 2     Alaska   10.0     263 44.5   West    48    317.5 6.614583
## 3    Arizona    8.1     294 31.0   West    80    333.1 4.163750
## 4   Arkansas    8.8     190 19.5  South    50    218.3 4.366000
## 5 California    9.0     276 40.6   West    91    325.6 3.578022
## 6   Colorado    7.9     204 38.7   West    78    250.6 3.212821

Regarding Summary statistics, we can use a base function summary for the basic summary statistics. However, if we need a more detailed approach, the package psych has a neat function describe which encompases the (usual) descriptive statistics in the one function.

num <- merged[, names(merged) != "Region"] # Remove categorical variables, pt 1 
num <- num[, names(num) != "State"] # Remove categorical variables, pt 2
descStat <-describe(num)
print(descStat)
##          vars  n   mean    sd median trimmed    mad   min    max  range  skew
## Murder      1 50   7.79  4.36   7.25    7.53   5.41  0.80  17.40  16.60  0.37
## Assault     2 50 170.76 83.34 159.00  168.48 110.45 45.00 337.00 292.00  0.22
## Rape        3 50  21.23  9.37  20.10   20.36   8.60  7.30  46.00  38.70  0.75
## Urban       4 50  65.54 14.47  66.00   65.88  17.79 32.00  91.00  59.00 -0.21
## TotalArr    5 50 199.78 93.37 185.20  198.02 116.46 53.10 382.30 329.20  0.17
## TAUR        6 50   3.12  1.59   2.75    2.93   1.32  0.86   8.14   7.27  1.05
##          kurtosis    se
## Murder      -0.95  0.62
## Assault     -1.15 11.79
## Rape         0.08  1.32
## Urban       -0.87  2.05
## TotalArr    -1.24 13.20
## TAUR         0.86  0.23

There’s no missing data. Assaults contribute the most to the total arrest numbers, but there seem to be no outliers, since means and medians are very close, and since trimmed mean (at 0.1 level) and mean are close. The TAUR data are the most skewed to the left, while from the kurtosis coefficient, we conclude that the rapes are the closest to the normal bell shape and the data aren’t “flattened” or “elongated”.

merged1 <- merged
merged1$Abbr <- state.abb # Adding the State Abbreviations for easier readability of plots
head(merged1)
##        State Murder Assault Rape Region Urban TotalArr     TAUR Abbr
## 1    Alabama   13.2     236 21.2  South    58    270.4 4.662069   AL
## 2     Alaska   10.0     263 44.5   West    48    317.5 6.614583   AK
## 3    Arizona    8.1     294 31.0   West    80    333.1 4.163750   AZ
## 4   Arkansas    8.8     190 19.5  South    50    218.3 4.366000   AR
## 5 California    9.0     276 40.6   West    91    325.6 3.578022   CA
## 6   Colorado    7.9     204 38.7   West    78    250.6 3.212821   CO
# List of numeric variables to plot: "Murder"   "Assault"  "Rape"     "Urban"    "TotalArr"  "TAUR" 

# Create individual scatterplots, we need to rotate the labels on the x-axis by 90 degrees.

p1 <- ggplot(merged1, aes(x = Abbr, y = Murder, color = Region)) +
  geom_point() +
  labs(x = "", y = "Murder", title = "Plot of Murder by State") +
  theme(axis.text.x = element_text(angle = 90))

print(p1)

p2 <- ggplot(merged1, aes(x = Abbr, y = Assault, color = Region)) +
  geom_point() +
  labs(x = "", y = "Assault", title = "Plot of Assault by State") +
  theme(axis.text.x = element_text(angle = 90))

print(p2)

p3 <- ggplot(merged1, aes(x = Abbr, y = Rape, color = Region)) +
  geom_point() +
  labs(x = "", y = "Rape", title = "Plot of Rape by State") +
  theme(axis.text.x = element_text(angle = 90))

print(p3)

p4 <- ggplot(merged1, aes(x = Abbr, y = Urban, color = Region)) +
  geom_point() +
  labs(x = "", y = "Urban", title = "Plot of Urban by State") +
  theme(axis.text.x = element_text(angle = 90))

print(p4)

p5 <- ggplot(merged1, aes(x = Abbr, y = TotalArr, color = Region)) +
  geom_point() +
  labs(x = "", y = "Total Arrests", title = "Plot of Total Arrests by State") +
  theme(axis.text.x = element_text(angle = 90))

print(p5)

p6 <- ggplot(merged1, aes(x = Abbr, y = TAUR, color = Region)) +
  geom_point() +
  labs(x = "", y = "TAUR", title = "Plot of TAUR by State") +
  theme(axis.text.x = element_text(angle = 90))

print(p6)

# We generate boxplots per Region

p1 <- ggplot(merged1, aes(x = Region, y = Murder, fill = Region)) +
  geom_boxplot() +
  labs(x = "", y = "Murder", title = "Boxplot of Murder by Region") 

print(p1)

p2 <- ggplot(merged1, aes(x = Region, y = Assault, fill = Region)) +
  geom_boxplot() +
  labs(x = "", y = "Assault", title = "Boxplot of Assault by Region")

print(p2)

p3 <- ggplot(merged1, aes(x = Region, y = Rape, fill = Region)) +
  geom_boxplot() +
  labs(x = "", y = "Rape", title = "Boxplot of Rape by Region") 

print(p3)

p4 <- ggplot(merged1, aes(x = Region, y = Urban, fill = Region)) +
  geom_boxplot() +
  labs(x = "", y = "Urban", title = "Boxplot of Urban by Region") 

print(p4)

p5 <- ggplot(merged1, aes(x = Region, y = TotalArr, fill = Region)) +
  geom_boxplot() +
  labs(x = "", y = "Total Arrests", title = "Boxplot of Total Arrests by Region")

print(p5)

p6 <- ggplot(merged1, aes(x = Region, y = TAUR, fill = Region)) +
  geom_boxplot() +
  labs(x = "", y = "TAUR", title = "Boxplot of TAUR by Region") 

print(p6)

# Function to calculate binwidth using Diaconis-Freedman rule
FDR <- function(x) {
  2 * IQR(x) / (length(x)^(1/3))
}


# Create individual histograms for each numeric variable
p1 <- ggplot(merged1, aes(x = Murder)) +
  geom_histogram(binwidth = FDR(merged1[,2])) +
  labs(x = "Murder", y = "Frequency", title = "Histogram of Murder") 

print(p1)

p2 <- ggplot(merged1, aes(x = Assault)) +
  geom_histogram(binwidth = FDR(merged1[,3])) +
  labs(x = "Assault", y = "Frequency", title = "Histogram of Assault")

print(p2)

p3 <- ggplot(merged1, aes(x = Rape)) +
  geom_histogram(binwidth = FDR(merged1[,4])) +
  labs(x = "Rape", y = "Frequency", title = "Histogram of Rape") 

print(p3)

p4 <- ggplot(merged1, aes(x = Urban)) +
  geom_histogram(binwidth = FDR(merged1[,6])) +
  labs(x = "Urban", y = "Frequency", title = "Histogram of Urban") 

print(p4)

p5 <- ggplot(merged1, aes(x = TotalArr)) +
  geom_histogram(binwidth = FDR(merged1[,7])) +
  labs(x = "TotalArr", y = "Frequency", title = "Histogram of Total Arrests")

print(p5)

p6 <- ggplot(merged1, aes(x = TAUR)) +
  geom_histogram(binwidth = FDR(merged1[,8])) +
  labs(x = "TAUR", y = "Frequency", title = "Histogram of TAUR") 

print(p6)

# Create a data frame with the count of states per region
reg<-table(merged1$Region)
region_counts <- as.data.frame(reg)
colnames(region_counts) <- c("Region", "Count")

# Plot the distribution of the number of states per region
p <- ggplot(region_counts, aes(x = Region, y = Count, fill = Region)) +
  geom_bar(stat = "identity") + 
  labs(x = "Region", y = "Number of States", title = "Distribution of Number of States per Region") 

print(p)

In the first part, we have created scatterplots, from which we can conclude that southern states dominate the number of assaults. Some western states have a large number of assaults as well. Southern states have the largest number of murders as well. On the other hand, western states dominate the rape scatterplot. Regarding urban development, one can see that the northeastern states are the most developed ones and have the smallest number of arrests. The connection between the two may be speculated to originate from the fact the police are overwhelmed in large cities and it’s generally easier to make an arrest in the rural area, conditional on finding the suspect’s whereabouts. Total arrests scatterplot shows us southern and western states dominate the plot. Regarding the TAUR statistic we invented, we can conclude southern states have the largest TAUR, while the north-central states have the lowest.

From the boxplots, we conclude there are no outliers in the number of murders. The number of assaults varies greatly in the west. North central has a low number of assaults in general, except the states of Illinois and Michigan, which have the number of assaults in the 250s range. The number of rapes varies the most in the west, being largest in Nevada, possibly because of Las Vegas, while being lowest in Idaho, which is the most rural western state. The urban status varies greatly in the northeast, being one of the highest for New Jersey, and being the lowest in the States for Vermont. Regarding the north central states, they are usually urban, except for the Dakotas. Regarding total arrests, the west has the largest range of data, being the highest for Arizona, which dominates the number of assaults, and the smallest for Hawaii, which is an island state full of tourists who don’t generally commit violent crimes. Regarding north central region, Michigan bears the highest arrest rate and the highest assault rates. TAUR has a few outliers. In the northeast, New York is the outlier, since it has the largest assault numbers, and hence the largest total arrests, while Urban is in the 80s range. New Hampshire is another outlier, bearing a very low number of total arrests and the urban development index in the 50s range. In the south, North Carolina has the largest TAUR, since it has the low urban development index and a large number of assaults. In the west, Alaska is the outlier, since it has the lowest urban development and a large assault number. This index can be useful for the federal government to decide on police force distribution in rural states with high assault rates.

The histograms tell us a bit more about the shape of the density, since the histogram serves as a density estimator. We need to specify the number of bins for the data, since the default is 30, which does not suit our needs. We use Freedman - Diaconis rule. We conclude right tail of murder data is heavy and there’s not a large number of states with low murder rates. The mode is somewhere in 5-10 range, but we conclude there’s a large number of the states having the large number of murders (10-15). Density of the assaults is monotonously decreasing, with a heavy left tail, which is expected. Histogram of rape is more or less bell-shaped, however right tail is long because Alaska, California and Nevada have rape arrests larger than 40. Urban development histogram has a heavier right tail, indicating the country is more urban than rural in general, which can be attributed to the economic development. Histogram of total arrests is flat, which is expected on a national level. Histogram of TAUR is bell-shaped with Alaska, Mississippi and Carolinas contributing to the longer right tail, since they have TAUR larger than 6. Regarding the state numbers, we see the number of southern states is the largest.The data per region seem to be well balanced.

Task 2

We need to load the dataset.

# Load the dataset and name it task2
task2 <-read_excel("C:/R/Task 2/Business School.xlsx")
colnames(task2) <- gsub(" ", "_", colnames(task2)) # Spaces in column names aren't needed

Now do the same thing we did in the task above, but for the average salary.

# Create a data frame with the count of entries per Undergrad_Degree
und <- table(task2$Undergrad_Degree)
undergrad_counts <- as.data.frame(und)
colnames(undergrad_counts) <- c("Undergrad_Degree", "Count")

# Plot the distribution of the number of entries per Undergrad_Degree
p <- ggplot(undergrad_counts, aes(x = Undergrad_Degree, y = Count)) +
  geom_bar(stat = "identity") + 
  labs(x = "Undergrad Degree", y = "Number of Entries", title = "Distribution of Number of Entries per Undergrad Degree") 

print(p)

The largest number of graduates is in Business and that’s a mode of the data.

# Calculate descriptive statistics for Annual Salary
desc <- describe(task2$Annual_Salary)
rownames(desc) = "Annual Salary"
print(desc)
##               vars   n   mean       sd median  trimmed     mad   min    max
## Annual Salary    1 100 109058 41501.49 103500 104600.2 25945.5 20000 340000
##                range skew kurtosis      se
## Annual Salary 320000 2.22     9.41 4150.15
# Transform Annual Salary to thousands
task2$AST <- task2$Annual_Salary / 1000
binwidth_fdr <- FDR(task2$AST)

# Histogram for Annual Salary in thousands
p <- ggplot(task2, aes(x = AST)) +
  geom_histogram(binwidth = binwidth_fdr)

print(p)

# Histogram for AST in thousands grouped by Undergrad Degree
p_undergrad <- ggplot(task2, aes(x = AST, fill = Undergrad_Degree)) +
  geom_histogram(binwidth = binwidth_fdr, position = "dodge") + labs(title = "Histogram of AST (in Thousands) by Undergraduate Degree")

print(p_undergrad)

# Histogram for AST in thousands grouped by Work Experience
p_work_experience <- ggplot(task2, aes(x = AST, fill = Work_Experience)) +
  geom_histogram(binwidth = binwidth_fdr, position = "dodge") + labs(title = "Histogram of AST (in Thousands) by Work Experience")

print(p_work_experience)

From the descriptive statistics, we deduce there are no missing data. The median and mean are significantly different, while the trimmed mean and the median aren’t. From the direction of the difference and the skewness coefficient, we conclude the data are left skewed and we have some outliers on the right. The range of the data also suggests that’s the case. Kurtosis coefficient also says the data is elongated, so we expect to have a large number of people earning the average wage, or at least somewhere around 104 000.

The data in this form aren’t great for analysis, so we decide to plot a histogram of salary data in thousands. The first histogram plot suggests what we have already suspected, while the histograms per profession suggest engineers and computer scientists will earn approximately an average wage, artists will approximately earn below average, while business and finance graduates will have their wages in a wide range. It can be seen that the undergraduate counts are non-balanced as well, but we disregard the fact in the previous analysis, since we focus on the general tendencies.

Prior experience is non-balanced, hence making conclusions is a slippery slope. From the plot we can say experience doesn’t play a significant role, however there is one outlier who has prior experience and is a business graduate, while the next highest earner is a finance graduate with no prior experience.

qqPlot(task2$MBA_Grade)

## [1] 92 68
shapiro.test(task2$MBA_Grade)
## 
##  Shapiro-Wilk normality test
## 
## data:  task2$MBA_Grade
## W = 0.98799, p-value = 0.5078
# Two-sided t-test for MBA_Grade equal to 74
t_test_two_sided <- t.test(task2$MBA_Grade, mu = 74, alternative = "two.sided")
print(t_test_two_sided)
## 
##  One Sample t-test
## 
## data:  task2$MBA_Grade
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

Prior to performing the t-test, it would be nice to test the hypothesis of data being normally distributed. From the Q-Q plot and the Shapiro test, we get the data are normally distributed and we can fully trust the t-test. The mean grade is not equal to 74 according to the t-test, since p-value is smaller than 0.05.

Task 3

Import the dataset Apartments.xlsx

# Load the dataset and name it task3
task3 <- read_excel("C:/R/Task 3/Apartments.xlsx")
colnames(task3) <- gsub(" ", "_", colnames(task3)) # Spaces in column names aren't needed