Talend is joining the fight against COVID-19 by collaborating with developers from the Singer open source community and Bytecode to create an ETL tool for COVID-19 datasets.
Talend standardizes the data, augments it with metadata, then routes the results to a data warehouse or data lake: Amazon Redshift, Amazon S3, Snowflake, Microsoft Azure Synapse Analytics, Delta Lake for Databricks, or Google BigQuery.
Data engineers and scientists can run the tool on their own infrastructure or use Stitch for free.
The COVID-19 integration covers several datasets:
- Johns Hopkins CSSE Data
- EU Data
- Italy Data
- NY Times US Data
- Neher Lab Scenarios Data
- COVID-19 Tracking Project
The data stored in these repositories lacks a common format. For instance, the EU Data comprises data from different countries, and the header names for the same type of data differ. Even slight changes like these require data professional take extra time and steps to cleanse and standardize data.
Having these datasets processed through Talend’s ETL gives users guaranteed consistency for this data so they can focus more on their models or visualizations and make faster and more confident decisions.
The tap utilizes the GitHub V3 API library to query and retrieve files stored in multiple GitHub repositories.
Users must get a GitHub token, which allows the tap to increase the number of API calls. Users can then select one or all of the supported datasets and the fields associated with them, select one of the Stitch destinations, and select the frequency of the loads.
Given that the data is typically updated more than once a day, Talend suggests a frequency of every 6 to 12 hours, but users can choose more frequent replication.
These datasets should be beneficial to anyone doing health research. Interested researchers can run the data import for free on the Stitch platform.
For more information about this news, visit www.talend.com.