How to join tables in the data lake
We organize the data so that joining tables from different institutions and themes is as simple as any other query. For this, we defined a standard methodology for data treatment, column naming, tables, and datasets.
How does the DB methodology work?
Information from different tables can be aggregated through identifier keys. An identifier key is a column whose name is unique across all tables in the data lake and is used to identify an entity.
Example of an identifier key
The year
column has the same name in all data lake tables - it always refers
to the variable that has any years from our calendar as its value.
When working with IBGE population data, the year
column, along with the
municipality
column, uniquely identify each row in the table:
-
There is no more than one row with the same year and municipality;
-
There is no row with null values for
year
ormunicipality
in the table;
Test it yourself: the queries below should return empty!
library("basedosdados")
# Search for any row that has repeated year and municipality
query <- "SELECT ano, municipio, count(*) as total
FROM `basedosdados.br_ibge_populacao.municipio`
GROUP BY ano, municipio
WHERE total > 1"
read_sql(query=query)
# Search for rows with null year or municipality
query <- "SELECT * FROM
`basedosdados.br_ibge_populacao.municipio`
WHERE ano IS NULL OR municipio IS NULL"
read_sql(query=query)
import basedadosdados as bd
# Search for any row that has repeated year and municipality
query = """SELECT ano, municipio, count(*) as total
FROM `basedosdados.br_ibge_populacao.municipio`
GROUP BY ano, municipio
WHERE total > 1"""
bd.read_sql(query=query)
# Search for rows with null year or municipality
query = """SELECT * FROM
`basedosdados.br_ibge_populacao.municipio`
WHERE ano IS NULL OR municipio IS NULL"""
bd.read_sql(query=query)
...
Crossing tables with identifier keys
The indication of a set of columns as an identifier key is made directly in the table metadata. Thus, you can know which tables can be crossed by comparing the set of identifier keys of each one.
Below we'll make an example of how to cross IBGE's population and GDP tables to obtain the GDP per capita of all Brazilian municipalities.
In the population and GDP tables, the year
and municipality
columns are
identifier keys. Therefore, we'll use these columns in our JOIN
function to
determine how to cross the tables.
library("basedosdados")
set_billing_id("<YOUR_PROJECT_ID>")
query <- "SELECT
pib.id_municipio,
pop.ano,
pib.PIB / pop.populacao as pib_per_capita
FROM `basedosdados.br_ibge_pib.municipio` as pib
JOIN `basedosdados.br_ibge_populacao.municipio` as pop
ON pib.id_municipio = pop.id_municipio AND pib.ano = pop.ano"
# You can download to your computer
dir <- tempdir()
data <- download(query, file.path(dir, "gdp_per_capita.csv"))
# Or load the query result into your analysis environment
data <- read_sql(query)
import basedosdados as bd
gdp_per_capita = """SELECT
pib.id_municipio ,
pop.ano,
pib.PIB / pop.populacao as pib_per_capita
FROM `basedosdados.br_ibge_pib.municipio` as pib
INNER JOIN `basedosdados.br_ibge_populacao.municipio` as pop
ON pib.id_municipio = pop.id_municipio AND pib.ano = pop.ano
"""
# You can download to your computer
bd.download(query=gdp_per_capita,
savepath="where/to/save/file",
billing_project_id=<YOUR_PROJECT_ID>)
# Or load the query result into pandas
df = bd.read_sql(gdp_per_capita, billing_project_id=<YOUR_PROJECT_ID>)
List of identifier keys
Geographic keys
-
Census tract:
id_setor_censitario
-
Municipality:
id_municipio
(standard),id_municipio_6
,id_municipio_tse
,id_municipio_rf
,id_municipio_bcb
-
Minimum Comparable Area:
id_AMC
-
Immediate region:
id_regiao_imediata
-
Intermediate region:
id_regiao_intermediaria
-
Microregion:
id_microrregiao
-
Mesoregion:
id_mesorregiao
-
Federal unit (State):
sigla_uf
(standard),id_uf
,uf
-
Region:
regiao
Time keys
ano
(year),semestre
(semester),mes
(month),semana
(week),dia
(day),hora
(hour)
Individual person keys
cpf
,pis
,nis
Legal entity keys
-
Company:
cnpj
-
School:
id_escola
Political keys
-
Candidate:
id_candidato_bd
-
Party:
sigla_partido
,partido