1.1 Kinesis data streams
Amazon Kinesis Data Streams is used to collect and process large streams of data records and Data sets in real-time. We can create data-processing applications, known as Kinesis Data Streams applications.
A Kinesis Data Streams application reads data from a data stream as data records and pushes it to the target. These Kinesis streams applications use the Kinesis Client Library, Kinesis consumer Library to push data and they can run on Amazon EC2 instances.
2.EC2 Instance Creation
Step1: In the AWS management console go to EC2 and click on Launch instance.
Now choose the Amazon LINUX AMI 2018.03.0 because it comes with more AWS tools pre-installed.
Step2: Now choose the instance type.
Step3: Now Launch the ec2 instance and download the pem file (contains the keypair) or use the existing one.
2.1 Putty configuration
Step1: Now we need to convert the downloaded pem file to ppk file using putty gen.
Step2: Launch PUTTY and use this ppk file in SSH(AUTH)
Step3: Provide the EC2 hostname and save the session for further reference
Step4: Click on open and EC2 Linux instance will be launched.
Attach a IAM role for EC2 instance to give all the permissions it needed.
Note: Attaching Administrator Access is not a Good Practice.
3.Kinesis Producer Library
3.1 Data to streams using KPL-python
An Amazon Kinesis Data Streams producer is an application that captures user data records from a source and puts them into a Kinesis data stream (also called data ingestion). The Kinesis Producer Library (KPL) simplifies producer application and allows the developers to achieve high write throughput to a Kinesis data stream.
The data from Ec2 instance is captured using KPL code and send to kinesis streams. So initially data from the “source” folder is captured and moved to S3 and archive folder simultaneously so that the python code captures data with infinite loop (streaming is done in a continuous process) using producer library and run the code in ec2 instance.
Step2: Python code
Python code is developed using Kinesis producer Library to capture data continuously and write to respective kinesis stream which is mentioned in the code.
Note: Kinesis stream needs to be created in AWS and specify that in code.
Along with the producer library, the required jar files as below are to be downloaded and put along with the python code.
Now trigger the python code and data will be continuously produced to Kinesis streams, this data can be consumed from streams within 24 hours by using consumer libraries and also can be sent to s3 using kinesis firehose delivery stream as a consumer.
Data captured into s3:
- Purpose of the article – Regarding how to capture data using kinesis producer library.
- Intended Audience – For people working on AWS
- References / Sources of the information referred –https://github.com/ludia/kinesis_producer
Contact for further details:
Rajya Lakshmi KUNA
Associate Trainee– Data lakes & DWH -Analytics