Homework 4

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(pastecs)

## 
## Attaching package: 'pastecs'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

county <- read_csv("https://corgis-edu.github.io/corgis/datasets/csv/county_demographics/county_demographics.csv")

## Rows: 3139 Columns: 43

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): County, State
## dbl (41): Age.Percent 65 and Older, Age.Percent Under 18 Years, Age.Percent ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

texas <- county %>%
  filter(State == "TX" | State == "Texas")

Variable Description

The variable selected for this analysis is median household income, which represents the median income earned by households within each Texas county. This variable reflects the midpoint of the income distribution and provides insight into the overall economic well-being of residents in each county.

Descriptive Statistics

stat.desc(texas$`Income.Median Houseold Income`)

##      nbr.val     nbr.null       nbr.na          min          max        range 
## 2.540000e+02 0.000000e+00 0.000000e+00 2.509800e+04 1.009200e+05 7.582200e+04 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
## 1.342329e+07 5.137100e+04 5.284759e+04 8.021239e+02 1.579691e+03 1.634243e+08 
##      std.dev     coef.var 
## 1.278375e+04 2.418985e-01

Remove Missing Values

texas_clean <- texas %>%
  filter(!is.na(`Income.Median Houseold Income`))

Histogram (Original Variable)

hist(texas_clean$`Income.Median Houseold Income`,
     main = "Median Household Income (Texas Counties)",
     xlab = "Income")

Log Transformation

texas_clean <- texas_clean %>%
  mutate(log_income = log(`Income.Median Houseold Income`))

Histogram (Transformed Variable)

hist(texas_clean$log_income,
     main = "Log of Median Household Income",
     xlab = "Log Income")

Interpretation

The original income distribution appears right-skewed, with most counties clustered at lower income levels and a smaller number of counties with higher incomes. After applying a log transformation, the distribution becomes more symmetric, reducing skewness and making the data more suitable for statistical analysis.