Style Guide
In this section we list all the standards from our style guide and data guidelines that we use at Data Basis. They help us maintain high quality in the data and metadata we publish.
You can use the left menu to navigate through the different topics on this page.
Naming datasets and tables
Datasets (dataset_id
)
We name datasets in the format <organization_id>_<description>
, where
organization_id
follows by default the geographic scope of the
organization that publishes the dataset:
organization_id | |
---|---|
Global | world_<organization> |
Federal | <country_code>_<organization> |
State | <country_code>_<state_code>_<organization> |
Municipal | <country_code>_<state_code>_<city>_<organization> |
country_code
andstate_code
are always 2 lowercase letters;organization
is the name or acronym (preferably) of the organization that published the original data (e.g.,ibge
,tse
,inep
).description
is a brief description of the dataset
For example, the GDP dataset from IBGE has as dataset_id
: br_ibge_pib
Not sure how to name the organization?
We suggest visiting their website and seeing how they refer to themselves (e.g., DETRAN-RJ would be br_rj_detran
)
Tables
Naming tables is less structured and therefore requires good judgment. But we have some rules:
- If there are tables for different entities, include the entity at the beginning of the name. Example:
municipality_value
,state_value
. - Do not include the time unit in the name. Example: name it
municipality
, notmunicipality_year
. - Keep names in singular. Example:
school
, notschools
. - Name the most disaggregated tables as
microdata
. Generally these have data at the person or transaction level.
Examples of dataset_id.table_id
Global | world_waze.alerts |
Waze alert data from different cities. |
Federal | br_tse_elections.candidates |
TSE political candidate data. |
Federal | br_ibge_pnad.microdata |
Microdata from the National Household Sample Survey produced by IBGE. |
Federal | br_ibge_pnadc.microdata |
Microdata from the Continuous National Household Sample Survey (PNAD-C) produced by IBGE. |
State | br_sp_see_teachers.workload |
Anonymized workload of active teachers in SP state education network. |
Municipal | br_rj_riodejaneiro_cmrj_legislative.votes |
Voting data from Rio de Janeiro City Council (RJ). |
Table formats
Tables should, whenever possible, be in long
format, rather than wide
.
Variable naming
Variable names must follow some rules:
- Use existing names in the repository as much as possible. Examples:
year
,month
,municipality_id
,state_code
,age
,position
,result
,votes
,revenue
,expense
,price
, etc. - Respect directory table patterns.
- Be as intuitive, clear, and extensive as possible.
- Have all lowercase letters (including acronyms), without accents, connected by
_
. - Do not include connectors like
of
,the
,and
,in
, etc. - Only have the
id_
prefix when the variable represents primary keys of entities (that would eventually have a directory table).- Examples that have it:
municipality_id
,state_id
,school_id
,person_id
. - Examples that don't:
network
,location
. - Important: when the dataset is in English, id becomes a suffix
- Examples that have it:
- Only have entity suffixes when the column's entity is different from the table's entity.
- Examples that have it: in a table with entity
person
, a column about municipal GDP would be calledmunicipality_gdp
. - Examples that don't: in a table with entity
person
, person characteristics would be calledname
,age
,sex
, etc.
- Examples that have it: in a table with entity
- List of allowed prefixes
name_
,date_
,number_
,quantity_
,proportion_
(percentage variables 0-100%),rate_
,ratio_
,index_
,indicator_
(boolean type variables),type_
,code_
,sequential_
.
- List of common suffixes
_pc
(per capita)
Measurement units
The rule is to keep variables with their original measurement units listed in this code, with the exception of financial variables where we convert old currencies to current ones (e.g. Cruzeiro to Real).
We catalog measurement units in standard format in the architecture table. Complete list here Examples: m
, km/h
, BRL
.
For deflated financial columns, we list the currency with the base year. Example: a column measured in reais in 2010 has unit BRL_2010
.
Variables must always have measurement units with base 1. In other words, having BRL
instead of 1000 BRL
, or person
instead of 1000 persons
. This information, as well as other column metadata, is recorded in the architecture table of the table.
Which variables to keep, add, and remove
We partially normalize our tables and have rules for which variables to include in production. They are:
- Remove variables from entity names that already exist in directories. Example: remove
municipality
from the table that already includesmunicipality_id
. - Remove variables serving as partitions. Example: remove
year
andstate_code
if the table is partitioned in these two dimensions. - Add primary keys for each existing entity. Example: add
municipality_id
to tables that only includemunicipality_tse
. - Keep all primary keys that already come with the table, but (1) add relevant keys (e.g.
state_code
,municipality_id
) and (2) remove irrelevant keys (e.g.region
).
Temporal coverage
Fill in the temporal_coverage
column in table, column, and key metadata (in dictionaries) according to the following pattern.
-
General format:
initial_date(temporal_unit)final_date
initial_date
andfinal_date
are in the corresponding temporal unit.- Example: table with unit
year
has coverage2005(1)2018
. - Example: table with unit
month
has coverage2005-08(1)2018-12
. - Example: table with unit
week
has coverage2005-08-01(7)2018-08-31
. - Example: table with unit
day
has coverage2005-08-01(1)2018-12-31
.
- Example: table with unit
-
Rules for filling in
- Table metadata
- Fill in the general format.
- Column metadata
- Fill in the general format, except when
initial_date
orfinal_date
are equal to the table's. In that case, leave it empty. - Example: suppose the table's coverage is
2005(1)2018
.- If a column appears only in 2012 and exists until 2018, we fill in its coverage as
2012(1)
. - If a column disappears in 2013, we fill in its coverage as
(1)2013
. - If a column exists in the same temporal coverage as the table, we fill in its coverage as
(1)
.
- If a column appears only in 2012 and exists until 2018, we fill in its coverage as
- Fill in the general format, except when
- Key metadata
- Fill in the same pattern of columns, but with the reference being the corresponding column, not the table.
- Table metadata
Cleaning STRINGs
- Categorical variables: initial uppercase and rest lowercase, with accents.
- Unstructured STRINGs: keep them as they are.
Value formats
- Decimal: American format, i.e., always
.
(dot) instead of,
(comma). - Date:
YYYY-MM-DD
- Time (24h):
HH:MM:SS
- Datetime (ISO-8601):
YYYY-MM-DDTHH:MM:SS.sssZ
- Null value:
""
(csv),NULL
(Python),NA
(R),.
or""
(Stata) - Proportion/percentage: between 0-100
Table partitioning
What is partitioning and what is its goal?
In a nutshell, partitioning a table is dividing it into multiple blocks/parts. The central objective is to reduce financial costs and increase performance, as the larger the volume of data, the greater the storage and query costs.
The reduction in costs and the increase in performance mainly occur because partitioning allows the data set to be reorganized into small grouped blocks. In practice, by performing the partitioning, it is possible to avoid that a query traverses the entire table just to bring a small data slice.
An example of our beloved RAIS:
- Without using partition filtering:
For this case, Bigquery scanned all (*) columns and rows of the dataset. It's worth noting that this cost is still not very large, as the base has already been partitioned. If this dataset hadn't passed through the partition process, this query would have cost a lot more money and time, as it involves a considerable volume of data.
- With partition filtering:
Here, we filter by the partitioned columns year
and state_code
. As a result, Bigquery only queries and returns the values from the year folder and the state_code subfolder.
When should a table be partitioned?
The first question that arises when dealing with partitioning is: from which number of lines a table should be partitioned? The documentation of GCP does not define a quantity x or y of lines that should be partitioned. The ideal is that tables are partitioned, with few exceptions. For example, tables with less than 10,000 lines, which will no longer receive data ingestion, do not have high storage and processing costs and, therefore, there is no need to be partitioned.
How to partition a table?
If the data is stored locally, it is necessary:
- Create the partitioned folders in your
/output
folder, using the language you are using.
Example of a partitioned table by year
and month
, using python
:
for year in [*range(2005, 2020)]:
for month in [*range(1, 13)]:
partition = output + f'table_id/year={year}/month={month}'
if not os.path.exists(partition):
os.makedirs(partition)
for year in [*range(2005, 2020)]:
for month in [*range(1, 13)]:
df_partition = df[df['year'] == year].copy() # The .copy is not necessary, it's just a good practice
df_partition = df_partition[df_partition['month'] == month]
df_partition.drop(['year', 'month'], axis=1, inplace=True) # It's necessary to exclude the columns used for partitioning
partition = output + f'table_id/year={year}/month={month}/table.csv'
df_partition.to_csv(partition, index=False, encoding='utf-8', na_rep='')
Examples of partitioned tables in R
:
Example of how to partition a table in SQL
:
CREATE TABLE `dataset_id.table_id` as (
year INT64,
month INT64,
col1 STRING,
col1 STRING
) PARTITION BY year, month
OPTIONS (Description='Description of the table')
Important rules for partitioning.
-
The types of columns that BigQuery accepts as partitioning are:
- Time unit column: tables are partitioned based on a
TIMESTAMP
,DATE
orDATETIME
column. - Processing time: tables are partitioned based on the
data/time
stamp when BigQuery processes the data. - Range of integers: tables are partitioned based on a column of integers.
- Time unit column: tables are partitioned based on a
-
The types of columns that BigQuery does not accept as partitioning are:
BOOL
,FLOAT64
,BYTES
, etc. -
BigQuery accepts up to 4,000 partitions per table.
-
In our BD, tables are usually partitioned by:
year
,month
,quarter
, andstate_code
. -
Note that when partitioning a table, it is necessary to exclude the corresponding column. Example: it is necessary to exclude the
year
column when partitioning byyear
.
Number of bases per pull request
Pull requests on Github should include a maximum of one dataset, but can include more than one base. In other words, they can involve one or more tables within the same dataset.
Dictionaries
- Each base includes only one dictionary (which covers one or more tables).
- For each table, column, and temporal coverage, each key maps uniquely to a value.
- Keys cannot have null values.
- Dictionaries must cover all available keys in the original tables.
- Keys can only have zeros to the left when the variable's number of digits has meaning. When the variable is
enum
default, we exclude the zeros to the left.- Example: we keep the zero to the left of the variable
br_bd_diretorios_brasil.cbo_2002:cbo_2002
, which has six digits, because the first digit0
means the category is from thegrand group = "Members of the armed forces, police, and firefighters"
. - For other cases, such as
br_inep_censo_escolar.stage:stage_education
, we exclude the zeros to the left. In other words, we change01
to1
.
- Example: we keep the zero to the left of the variable
- Values are standardized: without extra spaces, initial uppercase and rest lowercase, etc.
How to fill in the table dictionary metadata?
- Do not fill in the
spatial_coverage
(spatial_coverage
), i.e., leave the field empty. - Do not fill in the
temporal_coverage
(temporal_coverage
), i.e., leave the field empty. - Do not fill in the
observation_level
(observation_level
), i.e., leave the field empty.
Directories
Directories are the fundamental building blocks of our datalake. Our rules for managing directories are:
- Directories represent entities of the repository that have primary keys (e.g.,
state
,municipality
,school
) and time-based units (e.g.,data
,time
,day
,month
,year
). - Each directory table has at least one primary key with unique values and no nulls. Examples:
municipality:municipality_id
,state:state_code
. - Variable names with the
id_
prefix are reserved for primary keys of entities.
See all the tables already available here.
How to fill in the directory table metadata?
- Fill in the
spatial_coverage
(spatial_coverage
), which is the maximum spatial unit that the table covers. Example: sa.br, which means that the spatial aggregation level of the table is Brazil. - Do not fill in the
temporal_coverage
(temporal_coverage
), i.e., leave the field empty. - Fill in the
observation_level
(observation_level
), which consists of the observation level of the table, i.e., what each line represents. - Do not fill in the
temporal_coverage
(temporal_coverage
) of the columns of the table, i.e., leave the field empty.
Raw Data Sources
The field refers to the data in the raw data source, which has not yet passed through the Data Basis methodology, i.e., our _input_
. When you click on it, the idea is to redirect the user to the original data source page. Our rules for managing the raw data sources are:
- Include the name of the external link that leads to the raw data source. As a default, this name should be the organization's name or the portal's name that stores the data. Examples:
Educational Statistics: Open Data from Inep
,Penn World Tables: Groningen Growth and Development Centre
. - Fill in the raw data source metadata: Description, URL, Language, Has Structured Data, Has an API, Is Free, Requires Registration, Availability, Requires IP from Some Country, License Type, Temporal Coverage, Spatial Coverage, and Observation Level.
Thought of improvements for the standards defined?
Open an issue on our Github or send a message on Discord to talk to us :)