Let's start to collect data to TreasureData

1. First Step to TreasureData

When you start to use TreasureData, you will need to take care of the following things in short.

  • Collect Data
  • Query Data
  • Visualize Data

This page describes a first step to collect data to TreasureData.

1.image


1.1. Category of collecting

At first, you have to plan how you collect data from own data sources. But, there are several types of data source.

You can categorize the way of collecting data to TreasureData into several ways.

You want to import data from ...

  1. Web Application on Backend (Java, Rails, Node, PHP, ...)
  2. Web Application on Frontend (Javascript, ...)
  3. Message Queue System (Kafka,AWS Kinesis, ...)
  4. Cloud Storage (AWS S3, Google Cloud Storage, ...)
  5. Database (MySQL, PostgreSQL, ...)
  6. Moblie Application (iOS, Android, ...)
  7. External Cloud Services

By the way, You DON'T need to set schema for table before importing data. This is enable by schema-on-read feature, which is one of important feature on TreasureData.

So, please check which is suitable for your system, and make the next step forward through these related documents.


2. Let's collect data from ...

2.1. Web Application on Backend

When you import data from web application on backend, the architecture would be the following 2 patterns.

2.1.1. Only Forwarder Case

2.1.image1

This architecture is simple. You need to embed Fluent Logger into your web app and install fluentd into your web app server. Then, you can upload any data to TreasureData per a few second ~ a minute.

Regarding to Fluent Logger, it's implemented by each language. So, you can choose any Fluent Logger.

Next Step

  • See Fluent Logger
    • td-agent; td-agent is TreasureData's support package including fluentd)
    • Rails (Rails Apps use fluent-logger-ruby)
    • Node.js
    • Python
    • Java
      • Fluency (Fluency is another type of fluent-logger-java)

2.1.2. Forwarder - Aggregator Case

2.1.image

This architecture is built 2-stage structure by forwarder and aggregator. Forwarders focus to transfer data to aggregators, because forwarder should not be use too much resources by fluentd. Aggregators applies any filters and additional plugin, and send data to TreasureData. If you want to change any filters, you can change config of only aggregators.

Next Step


2.2. Web Application on Frontend

If you want to collect data from Frontend like Google Analytics, you can embded Javascript SDK into your web. Also, you can use Google Tag Manager for Javascript SDK.

Next Step

2.3. Message Queue System

As you know, Message Queue System is useful to scale up the rate with which messages are added to the queue or processed. Each system has a individual way to output data to TreasureData.

2.3.1. Apache Kafka

If you want to use Apache Kafka, your architecture would be the above.

In order to push data to Kafka from web app servers via fluentd, you need to use fluent-output-kafka. After buffering data on Kafka, you can pull data by using kafka-fluentd-consumer.

Next Step


2.3.2. Amazon Kinesis

In order to collecting from Amazon Kinesis, you need to use AWS Lambda due to calling API directly. The following official document shows how to use it by Python. If you want to do same thing by Java, you can do it by using td-client-java.

Next Step


2.3. Cloud Storage

I know that you love to store data on AWS S3 and Google Cloud Storage. Surely, TreasureData is able to collect data from them by 2 types of tool; Pull type and Push type.

2.3.1. Pull Type; DataConnector

DataConnector is TreasureData's hosted data loader. By using DataConnector, you don't need any additional servers to upload data. You just put data (Ex. csv, tsv, csv.gz,...) to cloud storage, and prepare a credential for DataConnector. Then, DataConnector can ingest your data from cloud storage.

Next Step


2.3.2. Push Type; Embulk (or BulkImport)

2.3.2.image

Embulk is a open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services. If you couldn't allow TreasureData to access directly, it would be nice to use Embulk. But, you need to setup it on own server as batch server.

NOTE: TreasureData provides similar tools; Embulk and Bulkimport. Bulkimport is embeded into td command. But, it might be deprecated, will not be updated. Because, Embulk is fixed several issues of BulkImport. So, I suggest you to use Embulk instead of Bulkimport.

Also, DataConnector's core system is build by Embulk.

Next Step


2.4. Database

Your service must have master data on databases; MySQL, PostgreSQL, Redshift, etc.. In this case, you can use same tool as Cloud Storage. Also, you need to check accessibility to databases.

2.4.1. Public Access; DataConnector

2.4.1.image

DataConnecto can collect data from several databases. So, if you can provide credential, you can use DataConnector.

If you have any concerns about security, TreasureData has 2 options; SSL and StaticIPs for DataConnector.

If you need more additional secruity, it would be better to use Embulk.

Next Step


2.4.2. Private Access; Embulk (or BulkImport)

2.4.2.image

If you couldn't allow TreasureData to access to you DB directly, it would be nice to use Embulk too. Embulk has a JDBC plugin, so if you have a paid RDB; Oracle and SQL Server, you should use Embulk too.

Next Step


2.5. Mobile Application

2.5.image

TreasureData provides mobile SDKs. Though integration them, you can import data on your application into Treasure Data without building your servers.

Next Step

2.5.1 Postbacks

2.5.1.image

I know you already embed several tools into your mobile apps. It might be hard to add TreasureData SDK into your mobile apps due to size or extra work. But, TreasureData has a postback feature betwen several mobile analytics service. If you setup it on mobile analytics service, you can transfer data to TreasureData with adding few codes.

The reason is why you use an extra analytic service as TreasureData. Because, general mobile analytics services doesn't allow users to access raw data. But, if you use this postback feature, you can access to rawdata on TreasureData.

Next Action


2.6. External Cloud Services

DataConnector has a feature to collect data from external cloud services. Through collecting these data, data from external cloud services collaborate with own application data.

If you want to integrate any other services, it would be nice to check this page, and click a logo you want to collect.


Conclusion

This page is just a first step for finding your suitable import tools. I'll show each best practice on another page.

Look forward to it!!