Aula 02 — Limpeza de Dados & Análise Univariada

MBA Data Science · Análise Exploratória de Dados com R

Professor(a): Edneide Ramalho

18/03/2026

🎯 Objetivos da aula

Ao final desta aula, o aluno será capaz de:

  • Importar e inspecionar um dataset real com problemas típicos de qualidade
  • Identificar e tratar valores ausentes (NAs), outliers e tipos incorretos de variáveis
  • Descrever variáveis com medidas de tendência central e dispersão
  • Visualizar distribuições com histogramas, boxplots e density plots usando ggplot2
  • Reconhecer que limpeza e análise univariada são um ciclo, não etapas separadas

Dataset: Ames Housing — preços de imóveis em Ames, Iowa (EUA)
Pacotes principais: tidyverse, janitor, skimr, naniar, visdat


1 Contexto do Problema

O dataset Ames Housing contém informações sobre 2.930 imóveis vendidos em Ames, Iowa, entre 2006 e 2010. Foi compilado por Dean De Cock e é amplamente usado como alternativa ao Boston Housing por ter características muito mais ricas e problemas reais de qualidade.

Nossa tarefa: entender a estrutura do dado, limpá-lo e descrever as principais variáveis antes de qualquer modelagem.

# Baixe o dataset em: https://www.kaggle.com/datasets/prevek18/ames-housing-dataset
# Salve como "ames.csv" na sua pasta de trabalho
library(readr)
ames_raw <- read_csv("dados/AmesHousing.csv") |> 
  clean_names()

glimpse(ames_raw)
Rows: 2,930
Columns: 82
$ order           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ pid             <chr> "0526301100", "0526350040", "0526351010", "0526353030"…
$ ms_sub_class    <chr> "020", "020", "020", "020", "060", "060", "120", "120"…
$ ms_zoning       <chr> "RL", "RH", "RL", "RL", "RL", "RL", "RL", "RL", "RL", …
$ lot_frontage    <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, NA, 63, 8…
$ lot_area        <dbl> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005, 5…
$ street          <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave"…
$ alley           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ lot_shape       <chr> "IR1", "Reg", "IR1", "Reg", "IR1", "IR1", "Reg", "IR1"…
$ land_contour    <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "HLS"…
$ utilities       <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "All…
$ lot_config      <chr> "Corner", "Inside", "Corner", "Corner", "Inside", "Ins…
$ land_slope      <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl"…
$ neighborhood    <chr> "NAmes", "NAmes", "NAmes", "NAmes", "Gilbert", "Gilber…
$ condition_1     <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm", "Norm…
$ condition_2     <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm"…
$ bldg_type       <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "Twnhs…
$ house_style     <chr> "1Story", "1Story", "1Story", "1Story", "2Story", "2St…
$ overall_qual    <dbl> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8, 8, 9, …
$ overall_cond    <dbl> 5, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5, 7, 2, …
$ year_built      <dbl> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 1995, …
$ year_remod_add  <dbl> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 1996, …
$ roof_style      <chr> "Hip", "Gable", "Hip", "Hip", "Gable", "Gable", "Gable…
$ roof_matl       <chr> "CompShg", "CompShg", "CompShg", "CompShg", "CompShg",…
$ exterior_1st    <chr> "BrkFace", "VinylSd", "Wd Sdng", "BrkFace", "VinylSd",…
$ exterior_2nd    <chr> "Plywood", "VinylSd", "Wd Sdng", "BrkFace", "VinylSd",…
$ mas_vnr_type    <chr> "Stone", "None", "BrkFace", "None", "None", "BrkFace",…
$ mas_vnr_area    <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 603,…
$ exter_qual      <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "Gd", "Gd", "Gd", …
$ exter_cond      <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
$ foundation      <chr> "CBlock", "CBlock", "CBlock", "CBlock", "PConc", "PCon…
$ bsmt_qual       <chr> "TA", "TA", "TA", "TA", "Gd", "TA", "Gd", "Gd", "Gd", …
$ bsmt_cond       <chr> "Gd", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
$ bsmt_exposure   <chr> "Gd", "No", "No", "No", "No", "No", "Mn", "No", "No", …
$ bsmt_fin_type_1 <chr> "BLQ", "Rec", "ALQ", "ALQ", "GLQ", "GLQ", "GLQ", "ALQ"…
$ bsmt_fin_sf_1   <dbl> 639, 468, 923, 1065, 791, 602, 616, 263, 1180, 0, 0, 9…
$ bsmt_fin_type_2 <chr> "Unf", "LwQ", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf"…
$ bsmt_fin_sf_2   <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0, 0…
$ bsmt_unf_sf     <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994, 76…
$ total_bsmt_sf   <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, 994…
$ heating         <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA"…
$ heating_qc      <chr> "Fa", "TA", "TA", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", …
$ central_air     <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
$ electrical      <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", …
$ x1st_flr_sf     <dbl> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, 102…
$ x2nd_flr_sf     <dbl> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0, 0,…
$ low_qual_fin_sf <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ gr_liv_area     <dbl> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616, 1…
$ bsmt_full_bath  <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, …
$ bsmt_half_bath  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ full_bath       <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, 1, …
$ half_bath       <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, …
$ bedroom_abv_gr  <dbl> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, 1, …
$ kitchen_abv_gr  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ kitchen_qual    <chr> "TA", "TA", "Gd", "Ex", "TA", "Gd", "Gd", "Gd", "Gd", …
$ tot_rms_abv_grd <dbl> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8, 8,…
$ functional      <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ"…
$ fireplaces      <dbl> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, …
$ fireplace_qu    <chr> "Gd", NA, NA, "TA", "TA", "Gd", NA, NA, "TA", "TA", "T…
$ garage_type     <chr> "Attchd", "Attchd", "Attchd", "Attchd", "Attchd", "Att…
$ garage_yr_blt   <dbl> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 1995, …
$ garage_finish   <chr> "Fin", "Unf", "Unf", "Fin", "Fin", "Fin", "Fin", "RFn"…
$ garage_cars     <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, …
$ garage_area     <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 440,…
$ garage_qual     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
$ garage_cond     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
$ paved_drive     <chr> "P", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
$ wood_deck_sf    <dbl> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 483, …
$ open_porch_sf   <dbl> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0, 5…
$ enclosed_porch  <dbl> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ x3ssn_porch     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ screen_porch    <dbl> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, 210…
$ pool_area       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pool_qc         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ fence           <chr> NA, "MnPrv", NA, NA, "MnPrv", NA, NA, NA, NA, NA, NA, …
$ misc_feature    <chr> NA, NA, "Gar2", NA, NA, NA, NA, NA, NA, NA, NA, "Shed"…
$ misc_val        <dbl> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, …
$ mo_sold         <dbl> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, 6, …
$ yr_sold         <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, …
$ sale_type       <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", …
$ sale_condition  <chr> "Normal", "Normal", "Normal", "Normal", "Normal", "Nor…
$ sale_price      <dbl> 215000, 105000, 172000, 244000, 189900, 195500, 213500…
💡 clean_names() do pacote janitor

Converte automaticamente nomes de colunas como "Lot.Area" ou "1stFlrSF" para lot_area e x1st_flr_sf. Sempre use no início do pipeline — evita dores de cabeça com espaços e caracteres especiais.


2 Inspeção Inicial

Antes de qualquer limpeza, precisamos entender o que temos. As perguntas iniciais são sempre as mesmas:

  • Quantas linhas e colunas?
  • Quais tipos de variáveis?
  • Há valores ausentes? Onde e quantos?
  • As variáveis fazem sentido para o problema?

2.1 Dimensões e tipos

cat("Linhas:", nrow(ames_raw), "\n")
Linhas: 2930 
cat("Colunas:", ncol(ames_raw), "\n")
Colunas: 82 
cat("Células totais:", nrow(ames_raw) * ncol(ames_raw), "\n")
Células totais: 240260 
# Contagem por tipo de variável
ames_raw |>
  summarise(across(everything(), class)) |>
  pivot_longer(everything(), names_to = "variavel", values_to = "tipo") |>
  count(tipo, sort = TRUE) |>
  kbl(col.names = c("Tipo", "Quantidade"), align = "lr") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Tipo Quantidade
character 45
numeric 37

2.2 Visão geral com skim()

skim(ames_raw)
Data summary
Name ames_raw
Number of rows 2930
Number of columns 82
_______________________
Column type frequency:
character 45
numeric 37
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
pid 0 1.00 10 10 0 2930 0
ms_sub_class 0 1.00 3 3 0 16 0
ms_zoning 0 1.00 2 7 0 7 0
street 0 1.00 4 4 0 2 0
alley 2732 0.07 4 4 0 2 0
lot_shape 0 1.00 3 3 0 4 0
land_contour 0 1.00 3 3 0 4 0
utilities 0 1.00 6 6 0 3 0
lot_config 0 1.00 3 7 0 5 0
land_slope 0 1.00 3 3 0 3 0
neighborhood 0 1.00 5 7 0 28 0
condition_1 0 1.00 4 6 0 9 0
condition_2 0 1.00 4 6 0 8 0
bldg_type 0 1.00 4 6 0 5 0
house_style 0 1.00 4 6 0 8 0
roof_style 0 1.00 3 7 0 6 0
roof_matl 0 1.00 4 7 0 8 0
exterior_1st 0 1.00 5 7 0 16 0
exterior_2nd 0 1.00 5 7 0 17 0
mas_vnr_type 23 0.99 4 7 0 5 0
exter_qual 0 1.00 2 2 0 4 0
exter_cond 0 1.00 2 2 0 5 0
foundation 0 1.00 4 6 0 6 0
bsmt_qual 80 0.97 2 2 0 5 0
bsmt_cond 80 0.97 2 2 0 5 0
bsmt_exposure 83 0.97 2 2 0 4 0
bsmt_fin_type_1 80 0.97 3 3 0 6 0
bsmt_fin_type_2 81 0.97 3 3 0 6 0
heating 0 1.00 4 5 0 6 0
heating_qc 0 1.00 2 2 0 5 0
central_air 0 1.00 1 1 0 2 0
electrical 1 1.00 3 5 0 5 0
kitchen_qual 0 1.00 2 2 0 5 0
functional 0 1.00 3 4 0 8 0
fireplace_qu 1422 0.51 2 2 0 5 0
garage_type 157 0.95 6 7 0 6 0
garage_finish 159 0.95 3 3 0 3 0
garage_qual 159 0.95 2 2 0 5 0
garage_cond 159 0.95 2 2 0 5 0
paved_drive 0 1.00 1 1 0 3 0
pool_qc 2917 0.00 2 2 0 4 0
fence 2358 0.20 4 5 0 4 0
misc_feature 2824 0.04 4 4 0 5 0
sale_type 0 1.00 2 5 0 10 0
sale_condition 0 1.00 6 7 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
order 0 1.00 1465.50 845.96 1 733.25 1465.5 2197.75 2930 ▇▇▇▇▇
lot_frontage 490 0.83 69.22 23.37 21 58.00 68.0 80.00 313 ▇▃▁▁▁
lot_area 0 1.00 10147.92 7880.02 1300 7440.25 9436.5 11555.25 215245 ▇▁▁▁▁
overall_qual 0 1.00 6.09 1.41 1 5.00 6.0 7.00 10 ▁▂▇▅▁
overall_cond 0 1.00 5.56 1.11 1 5.00 5.0 6.00 9 ▁▁▇▅▁
year_built 0 1.00 1971.36 30.25 1872 1954.00 1973.0 2001.00 2010 ▁▂▃▆▇
year_remod_add 0 1.00 1984.27 20.86 1950 1965.00 1993.0 2004.00 2010 ▅▂▂▃▇
mas_vnr_area 23 0.99 101.90 179.11 0 0.00 0.0 164.00 1600 ▇▁▁▁▁
bsmt_fin_sf_1 1 1.00 442.63 455.59 0 0.00 370.0 734.00 5644 ▇▁▁▁▁
bsmt_fin_sf_2 1 1.00 49.72 169.17 0 0.00 0.0 0.00 1526 ▇▁▁▁▁
bsmt_unf_sf 1 1.00 559.26 439.49 0 219.00 466.0 802.00 2336 ▇▅▂▁▁
total_bsmt_sf 1 1.00 1051.61 440.62 0 793.00 990.0 1302.00 6110 ▇▃▁▁▁
x1st_flr_sf 0 1.00 1159.56 391.89 334 876.25 1084.0 1384.00 5095 ▇▃▁▁▁
x2nd_flr_sf 0 1.00 335.46 428.40 0 0.00 0.0 703.75 2065 ▇▃▂▁▁
low_qual_fin_sf 0 1.00 4.68 46.31 0 0.00 0.0 0.00 1064 ▇▁▁▁▁
gr_liv_area 0 1.00 1499.69 505.51 334 1126.00 1442.0 1742.75 5642 ▇▇▁▁▁
bsmt_full_bath 2 1.00 0.43 0.52 0 0.00 0.0 1.00 3 ▇▆▁▁▁
bsmt_half_bath 2 1.00 0.06 0.25 0 0.00 0.0 0.00 2 ▇▁▁▁▁
full_bath 0 1.00 1.57 0.55 0 1.00 2.0 2.00 4 ▁▇▇▁▁
half_bath 0 1.00 0.38 0.50 0 0.00 0.0 1.00 2 ▇▁▅▁▁
bedroom_abv_gr 0 1.00 2.85 0.83 0 2.00 3.0 3.00 8 ▁▇▂▁▁
kitchen_abv_gr 0 1.00 1.04 0.21 0 1.00 1.0 1.00 3 ▁▇▁▁▁
tot_rms_abv_grd 0 1.00 6.44 1.57 2 5.00 6.0 7.00 15 ▁▇▂▁▁
fireplaces 0 1.00 0.60 0.65 0 0.00 1.0 1.00 4 ▇▇▁▁▁
garage_yr_blt 159 0.95 1978.13 25.53 1895 1960.00 1979.0 2002.00 2207 ▂▇▁▁▁
garage_cars 1 1.00 1.77 0.76 0 1.00 2.0 2.00 5 ▅▇▂▁▁
garage_area 1 1.00 472.82 215.05 0 320.00 480.0 576.00 1488 ▃▇▃▁▁
wood_deck_sf 0 1.00 93.75 126.36 0 0.00 0.0 168.00 1424 ▇▁▁▁▁
open_porch_sf 0 1.00 47.53 67.48 0 0.00 27.0 70.00 742 ▇▁▁▁▁
enclosed_porch 0 1.00 23.01 64.14 0 0.00 0.0 0.00 1012 ▇▁▁▁▁
x3ssn_porch 0 1.00 2.59 25.14 0 0.00 0.0 0.00 508 ▇▁▁▁▁
screen_porch 0 1.00 16.00 56.09 0 0.00 0.0 0.00 576 ▇▁▁▁▁
pool_area 0 1.00 2.24 35.60 0 0.00 0.0 0.00 800 ▇▁▁▁▁
misc_val 0 1.00 50.64 566.34 0 0.00 0.0 0.00 17000 ▇▁▁▁▁
mo_sold 0 1.00 6.22 2.71 1 4.00 6.0 8.00 12 ▅▆▇▃▃
yr_sold 0 1.00 2007.79 1.32 2006 2007.00 2008.0 2009.00 2010 ▇▇▇▇▃
sale_price 0 1.00 180796.06 79886.69 12789 129500.00 160000.0 213500.00 755000 ▇▇▁▁▁
📊 O que ler no skim()
  • n_missing: contagem absoluta de NAs
  • complete_rate: proporção de valores presentes (1.0 = sem NAs)
  • p0, p25, p50, p75, p100: quartis — dão a forma da distribuição
  • hist: mini-histograma inline — rápido para detectar assimetria

3 Valores Ausentes (NAs)

3.1 Mapa visual de NAs

vis_miss(ames_raw, sort_miss = TRUE) +
  labs(title = "Valores ausentes no Ames Housing",
       subtitle = "Ordenado por proporção de missings") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7))

3.2 Proporção de NAs por variável

na_summary <- ames_raw |>
  summarise(across(everything(), ~ mean(is.na(.)))) |>
  pivot_longer(everything(), names_to = "variavel", 
               values_to = "pct_na") |>
  filter(pct_na > 0) |>
  arrange(desc(pct_na)) |>
  mutate(pct_na_fmt = percent(pct_na, accuracy = 0.1))

na_summary |>
  kbl(col.names = c("Variável", "% NA", "% Formatado"),
      align = "lrr",
      caption = "Variáveis com valores ausentes") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  row_spec(which(na_summary$pct_na > 0.4), background = "#FAECE7")
Variáveis com valores ausentes
Variável % NA % Formatado
pool_qc 0.9955631 99.6%
misc_feature 0.9638225 96.4%
alley 0.9324232 93.2%
fence 0.8047782 80.5%
fireplace_qu 0.4853242 48.5%
lot_frontage 0.1672355 16.7%
garage_yr_blt 0.0542662 5.4%
garage_finish 0.0542662 5.4%
garage_qual 0.0542662 5.4%
garage_cond 0.0542662 5.4%
garage_type 0.0535836 5.4%
bsmt_exposure 0.0283276 2.8%
bsmt_fin_type_2 0.0276451 2.8%
bsmt_qual 0.0273038 2.7%
bsmt_cond 0.0273038 2.7%
bsmt_fin_type_1 0.0273038 2.7%
mas_vnr_type 0.0078498 0.8%
mas_vnr_area 0.0078498 0.8%
bsmt_full_bath 0.0006826 0.1%
bsmt_half_bath 0.0006826 0.1%
bsmt_fin_sf_1 0.0003413 0.0%
bsmt_fin_sf_2 0.0003413 0.0%
bsmt_unf_sf 0.0003413 0.0%
total_bsmt_sf 0.0003413 0.0%
electrical 0.0003413 0.0%
garage_cars 0.0003413 0.0%
garage_area 0.0003413 0.0%
na_summary |>
  slice_max(pct_na, n = 15) |>
  ggplot(aes(x = reorder(variavel, pct_na), y = pct_na)) +
  geom_col(fill = "#C04828", alpha = 0.85, width = 0.7) +
  geom_text(aes(label = pct_na_fmt), hjust = -0.15, size = 3.2,
            color = "#73726C") +
  coord_flip() +
  scale_y_continuous(labels = percent_format(), limits = c(0, 1.1),
                     expand = c(0, 0)) +
  labs(title = "Proporção de valores ausentes",
       subtitle = "Top 15 variáveis com mais NAs",
       x = NULL, y = "% de NAs") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"),
        panel.grid.major.y = element_blank())

⚠️ NA nem sempre significa “dado faltante”

No Ames Housing, variáveis como pool_qc, misc_feature e alley têm NAs que semanticamente significam “ausência” (imóvel sem piscina, sem ruela, etc.), não dado faltante. Sempre leia o dicionário de dados antes de imputar!

  • pool_qc: NA = sem piscina
  • alley: NA = sem acesso por ruela
  • fence: NA = sem cerca
  • misc_feature: NA = sem característica especial

3.3 Tratamento de NAs

ames_clean <- ames_raw |>

  # 1. NAs semânticos → categoria "Nenhum"
  mutate(across(
    c(pool_qc, misc_feature, alley, fence, fireplace_qu,
      garage_type, garage_finish, garage_qual, garage_cond,
      bsmt_qual, bsmt_cond, bsmt_exposure, bsmt_fin_type_1, bsmt_fin_type_2,
      mas_vnr_type),
    ~ replace_na(., "None")
  )) |>

  # 2. NAs numéricos semânticos → 0
  mutate(across(
    c(garage_yr_blt, garage_area, garage_cars,
      bsmt_fin_sf_1, bsmt_fin_sf_2, 
      bsmt_unf_sf, total_bsmt_sf,
      bsmt_full_bath, bsmt_half_bath, mas_vnr_area),
    ~ replace_na(., 0)
  )) |>

  # 3. Electrical: único NA real → imputar moda
  mutate(electrical = replace_na(electrical,
    names(sort(table(electrical), decreasing = TRUE))[1])) |>

  # 4. Lot frontage: imputar mediana por bairro
  group_by(neighborhood) |>
  mutate(lot_frontage = if_else(
    is.na(lot_frontage),
    median(lot_frontage, na.rm = TRUE),
    lot_frontage
  )) |>
  ungroup()

# Verificar: NAs restantes
remaining_na <- sum(is.na(ames_clean))
cat("NAs restantes após limpeza:", remaining_na, "\n")
NAs restantes após limpeza: 3 

4 Tipagem de Variáveis

4.1 Variáveis numéricas que deveriam ser categóricas

# Olhando as variáveis numéricas 
ames_clean |> 
  select(where(is.numeric)) |> 
  glimpse()
Rows: 2,930
Columns: 37
$ order           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ lot_frontage    <dbl> 141.0, 80.0, 81.0, 93.0, 74.0, 78.0, 41.0, 43.0, 39.0,…
$ lot_area        <dbl> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005, 5…
$ overall_qual    <dbl> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8, 8, 9, …
$ overall_cond    <dbl> 5, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5, 7, 2, …
$ year_built      <dbl> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 1995, …
$ year_remod_add  <dbl> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 1996, …
$ mas_vnr_area    <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 603,…
$ bsmt_fin_sf_1   <dbl> 639, 468, 923, 1065, 791, 602, 616, 263, 1180, 0, 0, 9…
$ bsmt_fin_sf_2   <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0, 0…
$ bsmt_unf_sf     <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994, 76…
$ total_bsmt_sf   <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, 994…
$ x1st_flr_sf     <dbl> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, 102…
$ x2nd_flr_sf     <dbl> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0, 0,…
$ low_qual_fin_sf <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ gr_liv_area     <dbl> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616, 1…
$ bsmt_full_bath  <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, …
$ bsmt_half_bath  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ full_bath       <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, 1, …
$ half_bath       <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, …
$ bedroom_abv_gr  <dbl> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, 1, …
$ kitchen_abv_gr  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ tot_rms_abv_grd <dbl> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8, 8,…
$ fireplaces      <dbl> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, …
$ garage_yr_blt   <dbl> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 1995, …
$ garage_cars     <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, …
$ garage_area     <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 440,…
$ wood_deck_sf    <dbl> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 483, …
$ open_porch_sf   <dbl> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0, 5…
$ enclosed_porch  <dbl> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ x3ssn_porch     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ screen_porch    <dbl> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, 210…
$ pool_area       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ misc_val        <dbl> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, …
$ mo_sold         <dbl> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, 6, …
$ yr_sold         <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, …
$ sale_price      <dbl> 215000, 105000, 172000, 244000, 189900, 195500, 213500…
# Variáveis codificadas como número mas que são categorias ordinais
ames_clean <- ames_clean |>
  mutate(
    # Mês e ano de venda → factor
    mo_sold   = factor(mo_sold, levels = 1:12,
                       labels = month.abb),
    yr_sold   = factor(yr_sold),

    # Qualidade geral e condição geral (escala 1–10) → ordered factor
    overall_qual = factor(overall_qual, levels = 1:10, ordered = TRUE),
    overall_cond = factor(overall_cond, levels = 1:10, ordered = TRUE),

    # MS SubClass: código numérico para tipo de imóvel → factor
    ms_sub_class = factor(ms_sub_class)
  )

# Confirmar tipos
ames_clean |>
  select(mo_sold, yr_sold, overall_qual, overall_cond, ms_sub_class) |>
  glimpse()
Rows: 2,930
Columns: 5
$ mo_sold      <fct> May, Jun, Jun, Apr, Mar, Jun, Apr, Jan, Mar, Jun, Apr, Ma…
$ yr_sold      <fct> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 201…
$ overall_qual <ord> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8, 8, 9, 4, …
$ overall_cond <ord> 5, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5, 7, 2, 5, …
$ ms_sub_class <fct> 020, 020, 020, 020, 060, 060, 120, 120, 120, 060, 060, 02…
💡 Ordered factors no R

Para variáveis ordinais (ex: qualidade de 1 a 10), use factor(..., ordered = TRUE). Isso permite comparações lógicas (overall_qual > 7) e modelos lineares tratam corretamente os contrastes.

4.2 Detecção de outliers

# Boxplot para detectar outliers no preço
p1 <- ames_clean |>
  ggplot(aes(y = sale_price)) +
  geom_boxplot(fill = "#E6F1FB", color = "#185FA5",
               outlier.color = "#C04828", 
               outlier.alpha = 0.6,
               width = 0.4) +
  scale_y_continuous(labels = dollar_format(prefix = "US$", big.mark = ".")) +
  labs(title = "Boxplot — Sale Price", y = "Preço de Venda") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_blank(),
        plot.title = element_text(face = "bold"))
# Identificar outliers pelo critério IQR
q1  <- quantile(ames_clean$sale_price, 0.25)
q3  <- quantile(ames_clean$sale_price, 0.75)
iqr <- q3 - q1
lower <- q1 - 1.5 * iqr
upper <- q3 + 1.5 * iqr

outliers_preco <- ames_clean |>
  filter(sale_price < lower | sale_price > upper) |>
  select(sale_price, gr_liv_area, overall_qual, neighborhood) |>
  arrange(desc(sale_price))
cat("Outliers pelo critério IQR:", nrow(outliers_preco), "imóveis\n")
Outliers pelo critério IQR: 137 imóveis
cat("Limite inferior:", dollar(lower), "\n")
Limite inferior: $3,500 
cat("Limite superior:", dollar(upper), "\n")
Limite superior: $339,500 
p1

# Outliers famosos do Ames Housing: área grande, preço baixo
ames_clean |>
  ggplot(aes(x = gr_liv_area, y = sale_price)) +
  geom_point(alpha = 0.3, color = "#185FA5", size = 1.5) +
  geom_point(data = filter(ames_clean, gr_liv_area > 4000), color = "#C04828", size = 3.5, shape = 18) +
  geom_text(data = filter(ames_clean, gr_liv_area > 4000),
            aes(label = paste0(gr_liv_area, " sqft\n", dollar(sale_price))),
            hjust = -0.1, size = 3, color = "#C04828") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = dollar_format(prefix = "US$", big.mark = ".")) +
  labs(title = "Área habitável vs. Preço de Venda",
       subtitle = "Losangos vermelhos: possíveis outliers influentes",
       x = "Área habitável (sqft)", y = "Preço de Venda") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

⚠️ Outliers: remover ou manter?

Os dois imóveis com área > 4.000 sqft e preço baixo são vendas atípicas (provavelmente venda em distress ou transação familiar). Em competições Kaggle com este dataset, é comum removê-los. Em análises reais, documente a decisão — nunca remova silenciosamente.

# Remover outliers de área conhecidos (documentado explicitamente)
ames_clean <- ames_clean |>
  filter(gr_liv_area <= 4000)

cat("Linhas após remoção de outliers de área:", nrow(ames_clean), "\n")
Linhas após remoção de outliers de área: 2925 

5 Análise Univariada

5.1 Variável resposta: sale_price

ames_clean |>
  summarise(
    n         = n(),
    media     = mean(sale_price),
    mediana   = median(sale_price),
    dp        = sd(sale_price),
    cv        = dp / media,
    min       = min(sale_price),
    max       = max(sale_price),
    assimetria = (mean(sale_price) - median(sale_price)) / sd(sale_price)
  ) |> 
  mutate(
    across(c(media, mediana, dp, min, max),
           ~ dollar(., prefix = "US$")),
    across(c(cv, assimetria), ~ round(., 3))
    ) |> 
  mutate(
    n = as.character(n),
    cv = as.character(cv),
    assimetria = as.character(assimetria)
  ) |> 
  pivot_longer(
    everything(), 
    names_to = "Estatística",
    values_to = "Valor") |> 
  kbl(align = "lr") |> 
    kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Estatística Valor
n 2925
media US$180,412
mediana US$160,000
dp US$78,554.86
cv 0.435
min US$12,789
max US$625,000
assimetria 0.26
p_price <- ames_clean |>
  ggplot(aes(x = sale_price)) +
  geom_histogram(bins = 50, fill = "#185FA5", alpha = 0.8, color = "white") +
  geom_vline(aes(xintercept = mean(sale_price)),
             color = "#C04828", linewidth = 1, linetype = "dashed") +
  geom_vline(aes(xintercept = median(sale_price)),
             color = "#3B6D11", linewidth = 1, linetype = "solid") +
  scale_x_continuous(labels = dollar_format(prefix = "US$", scale = 1e-3, suffix = "k")) +
  annotate("text", x = mean(ames_clean$sale_price) * 1.05,
           y = Inf, vjust = 2, size = 3,
           label = "Média", color = "#C04828") +
  annotate("text", x = median(ames_clean$sale_price) * 0.92,
           y = Inf, vjust = 2, size = 3,
           label = "Mediana", color = "#3B6D11") +
  labs(title = "Sale Price — Original",
       subtitle = "Distribuição assimétrica à direita",
       x = "Preço de Venda", y = "Frequência") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))


p_price

📐 Média vs. Mediana como diagnóstico de assimetria

Quando média > mediana, a distribuição é assimétrica à direita (cauda longa para valores altos). No sale_price:

  • Média: ~US$ 180k
  • Mediana: ~US$ 163k

A diferença indica que imóveis caros puxam a média para cima.

5.2 Variáveis numéricas contínuas

vars_continuas <- c("gr_liv_area", "lot_area", "total_bsmt_sf",
                    "x1st_flr_sf", "garage_area", "year_built")

plots_cont <- map(vars_continuas, function(var) {
  ames_clean |>
    ggplot(aes(x = .data[[var]])) +
    geom_histogram(bins = 40, fill = "#534AB7", alpha = 0.75, color = "white") +
    scale_x_continuous(labels = comma) +
    labs(title = var, x = NULL, y = NULL) +
    theme_minimal(base_size = 9) +
    theme(plot.title = element_text(face = "bold", size = 9))
})

wrap_plots(plots_cont, ncol = 3) +
  plot_annotation(
    title = "Distribuições — variáveis contínuas",
    subtitle = "Nota: escalas dos eixos X variam entre painéis",
    theme = theme(plot.title = element_text(face = "bold", size = 13))
  )

5.3 Variáveis categóricas

vars_cat <- c("neighborhood", "house_style", "bldg_type",
              "sale_condition", "exter_qual")

plots_cat <- map(vars_cat, function(var) {
  ames_clean |>
    count(.data[[var]], sort = TRUE) |>
    slice_max(n, n = 10) |>
    ggplot(aes(x = reorder(.data[[var]], n), y = n)) +
    geom_col(fill = "#854F0B", alpha = 0.8, width = 0.7) +
    coord_flip() +
    labs(title = var, x = NULL, y = NULL) +
    theme_minimal(base_size = 8) +
    theme(plot.title = element_text(face = "bold", size = 8))
})

wrap_plots(plots_cat, ncol = 2) +
  plot_annotation(
    title = "Frequência — variáveis categóricas (top 10 por variável)",
    theme = theme(plot.title = element_text(face = "bold", size = 13))
  )

5.4 Qualidade geral (overall_qual)

ames_clean |>
  count(overall_qual) |>
  mutate(pct = n / sum(n)) |>
  ggplot(aes(x = overall_qual, y = n)) +
  geom_col(aes(fill = as.integer(overall_qual)),
           alpha = 0.85, width = 0.75, show.legend = FALSE) +
  geom_text(aes(label = percent(pct, accuracy = 0.1)),
            vjust = -0.4, size = 3.2, color = "#73726C") +
  scale_fill_gradient(low = "#B5D4F4", high = "#0C447C") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.12))) +
  labs(title = "Distribuição da Qualidade Geral",
       subtitle = "Escala de 1 (muito ruim) a 10 (excelente)",
       x = "Qualidade Geral", y = "Contagem") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))


6 Tabela Resumo Final

ames_clean |>
  select(sale_price, gr_liv_area, lot_area,
         total_bsmt_sf, year_built, garage_area) |>
  pivot_longer(everything(), names_to = "variavel", values_to = "valor") |>
  group_by(variavel) |>
  summarise(
    n       = n(),
    media   = round(mean(valor, na.rm = TRUE), 1),
    mediana = round(median(valor, na.rm = TRUE), 1),
    dp      = round(sd(valor, na.rm = TRUE), 1),
    min     = round(min(valor, na.rm = TRUE), 1),
    max     = round(max(valor, na.rm = TRUE), 1),
    na_pct  = percent(mean(is.na(valor)), accuracy = 0.1)
  ) |>
  kbl(col.names = c("Variável", "n", "Média", "Mediana",
                     "DP", "Mín", "Máx", "% NA"),
      align = "lrrrrrrc",
      caption = "Estatísticas descritivas — variáveis numéricas selecionadas") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE)
Estatísticas descritivas — variáveis numéricas selecionadas
Variável n Média Mediana DP Mín Máx % NA
garage_area 2925 471.7 480 214.0 0 1488 0.0%
gr_liv_area 2925 1494.0 1441 486.3 334 3820 0.0%
lot_area 2925 10103.6 9428 7782.0 1300 215245 0.0%
sale_price 2925 180411.6 160000 78554.9 12789 625000 0.0%
total_bsmt_sf 2925 1046.5 989 421.5 0 3206 0.0%
year_built 2925 1971.3 1973 30.2 1872 2010 0.0%

7 Exercícios Práticos

📝 Exercício 1 — Inspecionar NAs em lot_frontage

Tarefa: Calcule a proporção de NAs em lot_frontage no dataset original (ames_raw). Em seguida, verifique se a proporção de NAs varia entre bairros (neighborhood). Qual bairro tem a maior proporção de NAs nessa variável?

# # Proporção geral de NAs em lot_frontage
# ames_raw |>
#   summarise(pct_na = _____(is.na(lot_frontage)))
# 
# # Proporção por bairro
# ames_raw |>
#   group_by(_____) |>
#   summarise(pct_na = mean(is.na(lot_frontage)),
#             n      = n()) |>
#   arrange(desc(pct_na))

📝 Exercício 2 — Distribuição de year_built

Tarefa: Crie um histograma da variável year_built usando ggplot2. Adicione uma linha vertical indicando a mediana. Interprete: a distribuição é assimétrica? Para qual lado?

# ames_clean |>
#   ggplot(aes(x = _____)) +
#   geom_histogram(bins = _____, fill = "#534AB7", alpha = 0.8, color = "white") +
#   geom_vline(xintercept = _____(ames_clean$year_built),
#              color = "#C04828", linewidth = 1.2, linetype = "dashed") +
#   labs(title = "Distribuição do Ano de Construção",
#        x = "Ano de Construção", y = "Frequência") +
#   theme_minimal()

📝 Exercício 3 — Preço médio por tipo de construção

Tarefa: Calcule a mediana do sale_price para cada categoria de bldg_type. Ordene do maior para o menor e crie um gráfico de barras horizontais. Qual tipo de imóvel tem a maior mediana de preço?

# ames_clean |>
#   group_by(_____) |>
#   summarise(mediana_preco = _____(sale_price)) |>
#   ggplot(aes(x = reorder(_____, mediana_preco), y = mediana_preco)) +
#   geom_col(fill = "#0F6E56", alpha = 0.85, width = 0.6) +
#   coord_flip() +
#   scale_y_continuous(labels = dollar_format(prefix = "US$", big.mark = ".")) +
#   labs(title = "Mediana do Preço por Tipo de Imóvel",
#        x = NULL, y = "Mediana do Preço de Venda") +
#   theme_minimal()

8 Síntese da Aula

🔁 Ciclo Limpeza → Univariada → Decisão

Esta aula ilustrou um princípio central da EDA: você não limpa os dados e depois os analisa — você os limpa e analisa ao mesmo tempo, em ciclos.

Próxima aula: com o dado limpo, avançamos para a análise bivariada — entender como as variáveis se relacionam entre si e com o preço de venda.