Lab 1: Describing Data

Question 1: Loading and Exploring the Dataset

head(state.x77)

##            Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766

states <- as.data.frame(state.x77)
attach(states)

Question 2: Measures of Central Tendency

mean(Population)

## [1] 4246.42

median(Population)

## [1] 2838.5

The sample mean is a measure of central tendency that showcases the average value of the set, taking the sum of all the observational values and dividing it by the total number of observations. Thus, the sample mean tells us that the average estimated population of all 50 states in 1975 is about 4,246,420 people (within the sample). The sample median is another measure of central tendency that tells us the value of the middle ranked observation, under the condition that the observations are arranged in numerical order. Thus, the sample median informs us that half of the 50 states in the sample observed populations over 2,838,500 while others observed populations underneath that same central value.

Question 3: Measures of Variation

var(Population)

## [1] 19931684

sd(Population)

## [1] 4464.491

The sample variance is a measure of spread that showcases the average of the squared differences between each point and the sample mean, indicating how much variance exists around the average value of the dataset. Thus, a high sample variance of 19,931,684E^6 indicates that the observed values of each 50 states populations create a considerable degree of variability and are widely spread out around the sample mean of 4,246,420 people. The variance is a this big a value because it is based on the squared deviations from the mean. Thus, standard deviation is often used as a better measurement for variability, deriving itself from the square root of the sample variance. The standard deviation of this sample tells us that, on average, the state populations differ from the mean by approximately 4,464,491 people, reflecting the notion that the population sizes of the states are quite spread out, creating greater variability.

Question 4: Extremes

max(Population)

## [1] 21198

min(Population)

## [1] 365

Question 5: Visualizing the Data Distribution

hist(Population)

The histogram showcases a large concentration of observations (state population sizes) among the values of 0 and 5,000,000, with frequencies of other observations outside that segment decreasing as less and less states experience large populations. Thus, majority of the states have populations between 0 and 5,000,000 people.

Question 6: Summing Variables

sum(Population)

## [1] 212321

Question 7: Creating a New Variable

big <- Population > 5000
table(big)

## big
## FALSE  TRUE 
##    38    12

The table illustrates that 12 states have populations larger than 5,000,000 people while the other 38 experience population sizes smaller than the designated value.

Question 8: Plotting Variables

plot(Area, Population)

There appears to be a relatively moderate, negative correlation between population size and the state size (in square miles), with a majority of population sizes concentrating around 0 to 5,000,000 people and a majority of the square mileage for those same states concentrating around 0 to 100,000 miles squared.

Question 9: How Many People per Square Mile?

pop.density <-Population/Area
hist(pop.density)

Most of the data on the on the histogram concentrates between the population density values of 0 and 0.2, with decreasing frequencies occurring as less states experience population densities over 0.2 and the following values. This observation tells us that most states experience low population densities, meaning they have a spread out population with few individuals per square mile of land. In essence, they experience relatively small population sizes compared to the amount of geographical area available.