PARCIAL PRIMER CORTE - Estadística y programación

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

## Librerias a usar

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(sf)

## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1; sf_use_s2() is TRUE

library(tigris)

## To enable caching of data, set `options(tigris_use_cache = TRUE)`
## in your R script or .Rprofile.

library(tidycensus)
library(mapview)
library(viridis)

## Loading required package: viridisLite

library(tidycensus)
library(knitr)
library(leaflet)
library(stringr)
library(ggplot2)
library(openintro)

## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata

options(tigris_use_cache = TRUE)

library(tidycensus)
library(tidyverse)

library(knitr)
library(leaflet)
library(stringr)
library(sf)
options(tigris_use_cache = TRUE)

## To install your API key for use in future sessions, run this function with `install = TRUE`.

census_api_key("33fa3208cdfd6d0618e3d2c5f64f3c02880ea593")

## To install your API key for use in future sessions, run this function with `install = TRUE`.

-Presentación del ejercicio. El propósito de este planteamiento lógico-matemático es analizar la densidad poblacional en los Estados Unidos, catalogando a los estados de acuerdo a esta variable. A continuación se evidencia una tabla con los datos que relacionan el tipo de vivienda ocupada en los distintos estados: *Cabe añadir que la casilla “name” se entenderá como “population”

v10 <- load_variables(2010, "sf1", cache = TRUE)
v10 <- v10 %>% 
       filter(grepl("population", tolower(label), fixed = TRUE))
kable(head(v10))

name	label	concept
H011001	Total population in occupied housing units	TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE
H011002	Total population in occupied housing units!!Owned with a mortgage or a loan	TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE
H011003	Total population in occupied housing units!!Owned free and clear	TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE
H011004	Total population in occupied housing units!!Renter occupied	TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE
H011A001	Population in occupied housing units	TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE (WHITE ALONE HOUSEHOLDER)
H011A002	Population in occupied housing units!!Owned with a mortgage or a loan	TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE (WHITE ALONE HOUSEHOLDER)

A continuación se evidencia una muestra del valor poblacional de algunos estados, ordenados ascendentemente de acuerdo al abecedario:

population <- get_decennial(geography = "state", variables = c(population = "H011001"), 
                            shift_geo = TRUE, geometry = TRUE)

## Warning: The `shift_geo` argument is deprecated and will be removed in a future
## release. We recommend using `tigris::shift_geometry()` instead.

## Getting data from the 2010 decennial Census

## Using feature geometry obtained from the albersusa package

## Using Census Summary File 1

## Please note: Alaska and Hawaii are being shifted and are not to scale.

## old-style crs object detected; please recreate object with a recent sf::st_crs()

## Warning: The `shift_geo` argument is deprecated and will be removed in a future
## release. We recommend using `tigris::shift_geometry()` instead.
## Getting data from the 2010 decennial Census
## Using feature geometry obtained from the albersusa package
## Using Census Summary File 1
## Please note: Alaska and Hawaii are being shifted and are not to scale.
## old-style crs object detected; please recreate object with a recent sf::st_crs()
kable(head(population))

GEOID	NAME	variable	value	geometry
04	Arizona	population	6252633	MULTIPOLYGON (((-1111066 -8…
05	Arkansas	population	2836987	MULTIPOLYGON (((557903.1 -1…
06	California	population	36434140	MULTIPOLYGON (((-1853480 -9…
08	Colorado	population	4913318	MULTIPOLYGON (((-613452.9 -…
09	Connecticut	population	3455945	MULTIPOLYGON (((2226838 519…
11	District of Columbia	population	561702	MULTIPOLYGON (((1960720 -41…

NOTA: En la anterior tabla es posible percibir que California es el estado que preside la densidad poblacional, seguida de Arizona, Colorado, Connecticut, Arkansas y Distrito de Columbia. Además cabe añadir que, se tiene en cuenta las variables espaciales de cada estado para concluir con dichos resultados. La relación entre los resultados poblacionales se pueden interpretar en la siguiente gráfica:

# Datos
x <- c(04, 05, 06, 08, 09, 11)
y <- c(6252633, 2836987, 36434140, 4913318, 3455945, 561702)

# Vectores
plot(x, y, type = "l")

Para dar mayor claridad a la propuesta, se presenta un mapa general del país estadounidense con un panorama completo de la densidad poblacional en los distintos estados:

pal <- colorNumeric(palette = "viridis", 
                    domain = population$value)
population %>%
  st_transform(crs = "+init=epsg:4326") %>%
  leaflet(width = "100%") %>%
  addProviderTiles(provider = "CartoDB.Positron") %>%
  addPolygons(popup = ~ str_extract(NAME, "^([^,]*)"),
              stroke = FALSE,
              smoothFactor = 0,
              fillOpacity = 0.7,
              color = ~ pal(value)) %>%
  addLegend("bottomright", 
            pal = pal, 
            values = ~ value,
            title = "Population",
            #labFormat = labelFormat(prefix = "$"),
            opacity = 1)

## Warning in CPL_crs_from_input(x): GDAL Message 1: +init=epsg:XXXX syntax is
## deprecated. It might return a CRS with a non-EPSG compliant axis order.

ANÁLISIS DEL EJERCICIO:

Medidas de localización y dispersión

library(tidyverse)
library(sf)
library(tigris)
library(tidycensus)
library(mapview)
library(viridis)
library(tidycensus)
library(knitr)
library(leaflet)
library(stringr)
library(ggplot2)
library(openintro)
options(tigris_use_cache = TRUE)

-Medidas de tendencia central

A continuación se busca el punto central de los datos generales que arroja el mapa

x<-c(5000000,10000000,15000000,20000000,25000000, 30000000,35000000);mean(x);median(x);table(x)

## [1] 2e+07

## [1] 2e+07

## x
##   5e+06   1e+07 1.5e+07   2e+07 2.5e+07   3e+07 3.5e+07 
##       1       1       1       1       1       1       1

Media:

with(population, mean(value, na.rm = TRUE))

## [1] 5897220

*Siendo esta la probabilidad como valor esperado de los datos

Mediana:

with(population, median(value, na.rm = TRUE))

## [1] 4213497

*siendo este el dato central de la secuencia de datos que indicaría el mapa poblacional

Moda:

with(population, as.numeric(names(table(value))[table(value)==max(table(value))]))

##  [1]   549914   561702   600412   647535   683879   780130   873521   960566
##  [9]  1009904  1276366  1292816  1317421  1538631  1775176  1803612  2016550
## [17]  2664397  2717733  2774044  2836987  2875333  2948243  3455945  3639334
## [25]  3744432  4213497  4405945  4486210  4663920  4913318  5168530  5536772
## [33]  5635177  5814785  6192633  6252633  6296879  6308747  6585165  7761190
## [41]  8605018  9278237  9434454  9654572 11230238 12276266 12528859 18379601
## [49] 18792424 24564422 36434140

*Siendo estos los valores que más se repiten en el mapa

Medidas de variabilidad o dispersión de los datos

A continuación se analizará que tanto se alejan los datos en relación a la media aritmética

Desviación estándar:

Poblacional:

with(population, sqrt(var(value, na.rm = TRUE)*(length(value)-1)/length(value)))

## [1] 6598280

*Teniendo que la cifra es alta, nos indica una gran dispersión de los datos en relacióna la media

Muestral:

with(population, sqrt(var(value, na.rm = TRUE)))

## [1] 6663936

*Valor insesgado del varaizan poblacional

Media absoluta:

with(population, mean(abs(value-mean(value, na.rm = TRUE)), na.rm = TRUE))

## [1] 4326382

*DMA considera todos los datos poblacionales, no sólo el mayor y el menor y mide el promedio de la variación

Varianza:

poblacional:

with(population, mean((value-mean(value, na.rm = TRUE))**2, na.rm = TRUE))

## [1] 4.35373e+13

*siendo este el punto de inicio de la nube de datos general en la población estadounidense

Rango:

Range = function(x){
    maximun = max(x, na.rm = TRUE)
    minimun = min(x, na.rm = TRUE)
    Range = maximun - minimun
    return(Range)
}
with(population, Range(value))

## [1] 35884226

*Diferencia entre el valor mayor y el menor de las poblaciones

with(population, range(value, na.rm = TRUE))

## [1]   549914 36434140

*Estos son los valores que se uso en la resta, el menor y el mayor

with(population, diff(range(value, na.rm = TRUE)))

## [1] 35884226

*Comprobación: se restan y genera nuevamente el rango

Rango intercuartílico:

with(population, IQR(value, na.rm = TRUE))

## [1] 4790052

*se divide la información entre cuartiles y se calcula el rango entre cada uno de ellos

Distribución normal:

ingresos.medianos <- as.data.frame(rnorm(n = length(population$value), mean = mean(population$value, na.rm = TRUE), sd = sd(population$value, na.rm = TRUE)))

library(ggplot2)
p <- ggplot(ingresos.medianos, aes(x=`rnorm(n = length(population$value), mean = mean(population$value, na.rm = TRUE), sd = sd(population$value, na.rm = TRUE))`)) + geom_density()
p

options(tigris_use_cache = TRUE)

*Siendo esta ena ditribución de probabilidad para las variables poblacionales de acuerdo a los parametros de localización (media, mediana y moda).

p + geom_vline(aes(xintercept=mean(`rnorm(n = length(population$value), mean = mean(population$value, na.rm = TRUE), sd = sd(population$value, na.rm = TRUE))`)), color="blue", linetype="dashed", size=1)

library(ggplot2)
p <- ggplot(population, aes(x=value)) + 
  geom_density()
p

El sesgo poblacional

library(e1071)
with(ingresos.medianos, skewness(`rnorm(n = length(population$value), mean = mean(population$value, na.rm = TRUE), sd = sd(population$value, na.rm = TRUE))`, na.rm = TRUE))

## [1] 0.3332498

library(e1071)
with(population, skewness(value, na.rm = TRUE))

## [1] 2.510988

La curtosis poblacional

library(e1071)
with(ingresos.medianos, kurtosis(`rnorm(n = length(population$value), mean = mean(population$value, na.rm = TRUE), sd = sd(population$value, na.rm = TRUE))`, na.rm = TRUE))

## [1] -0.6490457

library(e1071)
with(population, kurtosis(value, na.rm = TRUE))

## [1] 7.513754

PARCIAL PRIMER CORTE - Estadística y programación

Elízabeth Pulido y Sara Gómez

2022-09-05