Upload data to DB
Why should my organization upload data to DB?
-
Ability to cross-reference your databases with data from different organizations in a simple and easy way. There are already hundreds of public datasets from the largest organizations in Brazil and worldwide present in our datalake.
-
Commitment to transparency, data quality, and development of better research, analysis, and solutions for society. We not only democratize access to open data but also quality data. We have a specialized team that reviews and ensures the quality of data added to the datalake.
-
Participation in an ever-growing community: thousands of journalists, researchers, developers already use and follow Data Basis.
Step by step to upload data
Want to upload data to DB and help us build this repository? Wonderful! We've organized everything you need in the manual below in 8 steps
To facilitate the explanation, we'll follow a ready-made example with data from RAIS.
You can navigate through the steps in the menu on the left.
We strongly suggest joining our Discord channel to ask questions and interact with the team and other contributors! 😉
Before starting
Some knowledge is necessary to carry out this process:
- Python, R, SQL and/or Stata: to create data capture and cleaning codes.
- Command line: to set up your local environment and connection with Google Cloud.
- Github: to upload your code for review by our team.
Don't have some of these skills but want to contribute?
We have a data team that can help you, just join our Discord and send a message in #want-to-contribute.
How does the process work?
- Choose the dataset and understand more about the data - first we need to know what we're dealing with.
- Download our template folder - it's time to structure the work to be done
- Fill in the architecture tables - it's essential to define the data structure before we start treatment
- Write data capture and cleaning code - time to get to work!
- (If necessary) Organize auxiliary files - because even data needs guides
- (If necessary) Create dictionary table - time to build the dictionaries
- Upload everything to Google Cloud - after all, that's where DB data is stored
- Send everything for review - a look from our team to ensure everything is ready for production!
1. Choose the dataset and understand more about the data
We keep the list of datasets for volunteers in our Github. To start uploading a base of your interest, just open a new issue with data. If your base (dataset) is already listed, just mark your Github user as assignee
Your first task is to fill in the information in the issue. These information will help you understand the data better and will be very useful for treatment and filling in metadata.
When you finish this step, call someone from the data team to that the information you mapped about the dataset already enter our site!
2. Download our template folder
Download here the template
template
and rename it to <dataset_id>
(defined in the issue of step 1). This template folder simplifies and organizes all the steps from here on. Its
structure is as follows:
<dataset_id>/
code/
: Necessary codes for capture and cleaning of data (we'll see more in step 4).input/
: Contains all the original files with data, exactly as downloaded from the primary source. (we'll see more in step 4).output/
: Final files, already in the ready-to-upload format (we'll see more in step 4).tmp/
: Any temporary files created by the code in/code
during the cleaning and treatment process (we'll see more in step 4).extra/
architecture/
: Architecture tables (we'll see more in step 3).auxiliary_files/
: Auxiliary files to the data (we'll see more in step 5).dicionario.csv
: Dictionary table for the entire dataset (we'll see more in step 6).
Only the code
folder will be committed to your project, the other files will only exist locally or in Google Cloud.
3. Fill in the architecture tables
The architecture tables determine what the structure of each table in your dataset is. They define, for example, the name, order, and metadata of the variables, as well as compatibilities when there are changes in versions (for example, if a variable changes name from one year to the next).
Each dataset table must have its own architecture table (spreadsheet), which must be filled in Google Drive to allow correction by our data team.
Example: RAIS - Architecture tables
The RAIS architecture tables can be consulted here. They are a great reference for you to start your work since they have many variables and examples of various situations you might end up encountering.
To fill in each table of your dataset, follow these steps:
A each beginning and end of step, consult our style guide to ensure you're following the BD standardization
- List all the variables in the data in the
original_name
column- Obs: If the base changes the name of the variables over time (like RAIS), it's necessary to make compatibilities between years for all the variables by filling in the
original_name_YYYY
column for each year or month available
- Obs: If the base changes the name of the variables over time (like RAIS), it's necessary to make compatibilities between years for all the variables by filling in the
- Rename the variables according to our manual in the
name
column - Understand the type of variable and fill in the
bigquery_type
column - Fill in the description in the
description
column according to the manual - From the compatibilities between years and/or queries to the raw data, fill in the temporal coverage in
temporal_coverage
for each variable- Obs: If the variables have the same temporal coverage as the table, fill in only with '(1)'
- Indicate with 'yes' or 'no' if there's a dictionary for the variables in
covered_by_dictionary
- Check if the variables represent any entity present in the directories to fill in the
directory_column
- For variables of type
int64
orfloat64
, check if it's necessary to include a measurement unit - Reorder the variables according to the manual
When you finish filling in the architecture tables, contact the Data Basis team to validate everything. It's necessary that it's clear what the final format of the data should be before starting to write the code. This way we avoid redoing the work.
4. Write data capture and cleaning code
After validating the architecture tables, we can write the capture and cleaning codes for the data.
-
Capture: Code that automatically downloads all the original data and saves it in
/input
. These data can be available on portals or links FTP, can be scraped from websites, among others. -
Cleaning: Code that transforms the original data saved in
/input
into clean data, saves it in the/output
folder, to be later uploaded to DB.
Each clean table for production can be saved as a single file or, if it's very large (e.g. above 200 mb), it can be partitioned in the Hive format in several sub-files. The accepted formats are .csv
or .parquet
. Our recommendation is to partition tables by year
, month
, and state_abbreviation
. The partitioning is done through the folder structure, see the example below to visualize how.
Example: RAIS - Partitioning
The microdados_vinculos
table from RAIS, for example, is a very large table (+250GB) so we partition it by year
and state_abbreviation
. The partitioning was done using the folder structure /microdados_vinculos/year=YYYY/state_abbreviation=XX
.
Required patterns in the code
- Must be written in Python, R or Stata - so that the review can be performed by our team.
- Can be in script (
.py
,.R
, ...) or notebooks (Google Colab, Jupyter, Rmarkdown, etc). - File paths must be shortcuts relative to the root folder
(
<dataset_id>
), that is, they must not depend on the paths of your computer. - The cleaning must follow our style guide and the best programming practices.
Example: PNAD Continuous - Cleaning Code
The cleaning code was built in R and can be consulted here.
Example: Activity in the Legislative Chamber - Download and Cleaning Code
The cleaning code was built in Python can be consulted here
5. (If necessary) Organize auxiliary files
It's common for databases to be made available with auxiliary files. These can include technical notes, collection and sampling descriptions, etc. To help users of Data Basis have more context and understand the data better, organize all these auxiliary files in /extra/auxiliary_files
.
Feel free to structure sub-folders as you like there. What matters is that it's clear what these files are.
6. (If necessary) Create dictionary table
Sometimes, especially with old bases, there are multiple dictionaries in Excel or other formats. In Data Basis, we unify everything in a single file in .csv
format - a single dictionary for all columns of all tables in your dataset.
Important details on how to build your dictionary are in our style guide.
Example: RAIS - Dictionary
The complete dictionary can be consulted here. It already has the standard structure we use for dictionaries.
7. Upload everything to Google Cloud
All set! Now all that's left is to upload to Google Cloud and send for review. For that, we'll use the basedosdados
client (available in Python) that facilitates the process.
Since there's a cost for storage in the storage, to finalize this step we'll need to make you available an api_key specifically for volunteers to upload the data to our development environment. So, join our Discord channel and call us in 'want-to-contribute'
Configure your credentials locally
7.1 Install our client in your terminal: pip install basedosdados
.
7.2 Run import basedosdados as bd
in python and follow the step-by-step process to configure locally with the credentials of your project in Google Cloud. Fill in the information as follows:
* STEP 1: y
* STEP 2: basedosdados-dev (put the .json passed by the bd team in the credentials folder)
* STEP 3: y
* STEP 4: basedosdados-dev
* STEP 5: https://api.basedosdados.org/api/v1/graphql
Upload the files to the Cloud
The data will pass through 3 places in Google Cloud:
- Storage: also called GCS is the local where the "cold" files (architectures, data, auxiliary files) will be stored.
- BigQuery-DEV-Staging: table that connects the data from storage to the basedosdados-dev project in bigquery
- BigQuery-DEV-Production: table used for testing and treatment via SQL of the dataset
7.3 Create the table in the GCS bucket and BigQuery-DEV-staging, using the Python API, as follows:
```python
import basedosdados as bd
tb = bd.Table(
dataset_id='<dataset_id>',
table_id='<table_id>')
tb.create(
path='<path_to_the_data>',
if_table_exists='raise',
if_storage_data_exists='raise',
)
```
The following parameters can be used:
- `path` (required): the complete path of the file on your computer, like: `/Users/<your_username>/projects/basedosdados/mais/bases/[DATASET_ID]/output/microdados.csv`.
!!! Tip "If your data is partitioned, the path must point to the folder where the partitions are. Otherwise, it must point to a `.csv` file (for example, microdados.csv)."
- `force_dataset`: command that creates the dataset configuration files in BigQuery.
- _True_: the dataset configuration files will be created in your project and, if it doesn't exist in BigQuery, it will be created automatically. **If you already created and configured the dataset, don't use this option, as it will overwrite files**.
- _False_: the dataset won't be recreated and, if it doesn't exist, it will be created automatically.
- `if_table_exists` : command used if the **table already exists in BQ**:
- _raise_: returns an error message.
- _replace_: replaces the table.
- _pass_: does nothing.
- `if_storage_data_exists`: command used if the **data already exists in Google Cloud Storage**:
- _raise_: returns an error message
- _replace_: replaces the existing data.
- _pass_: does nothing.
!!! Info "If the project doesn't exist in BigQuery, it will be automatically created"
Consult our API for more details on each method.
7.4 Create the .sql and schema.yml files from the architecture table following this documentation
If you need, at this moment you can change the SQL query to perform final treatments from the staging
table, you can include columns, remove columns, perform algebraic operations, substitute strings, etc. SQL is the limit!
7.5 Run and test the models locally following this documentation
7.6 Upload the table metadata to the site:
For now, only the data team has permissions to upload the table metadata to the site, so it will be necessary to contact us. We're already working to make it possible for volunteers to update data on the site in the near future.
7.7 Upload the auxiliary files:
st = bd.Storage(
dataset_id = <dataset_id>,
table_id = <table_id>)
st.upload(
path='caminho_para_os_auxiliary_files',
mode = 'auxiliary_files',
if_exists = 'raise')
8. Send everything for review
Yay, that's it! Now all that's left is to send everything for review in the repository of Data Basis.
- Clone our repository locally.
- Give a
cd
to the local folder of the repository and open a new branch withgit checkout -b [dataset_id]
. All additions and modifications will be included in this branch. - For each new table, include the file with name
table_id.sql
in thequeries-basedosdados/models/dataset_id/
folder by copying the queries you developed in step 7. - Include the schema.yaml file developed in step 7
- If it's a new dataset, include the dataset according to the instructions in the
queries-basedosdados/dbt_project.yaml
file (don't forget to follow the alphabetical order to not mess up the organization) - Include your data capture and cleaning code in the
queries-basedosdados/models/dataset_id/code
folder - Now it's just about publishing the branch, opening a PR with the labels 'table-approve' and marking the data team for correction
And now? Our team will review the data and metadata submitted via Github. We can contact you to ask questions or request changes to the code. When everything is OK, we'll do a merge of your pull request and the data will be automatically published on our platform!