Big Industries Academy
Exploring Messaging and Streaming Technologies Part4: AWS Kinesis
The fourth technology Francine Anestis is covering in the series "Exploring Messaging and Streaming Technologies" is Amazon Kinesis. She explores the Key Features, Architecture, Use Cases, Strengths & Weaknesses, Cost and Maturity Level.
Amazon Kinesis is a suite of services on AWS designed for real-time data streaming and processing. It enables you to collect, process, and analyze real-time, streaming data to get timely insights and react quickly to new information. Kinesis is particularly suitable for applications that require high-frequency data ingestion and processing, such as log and event data collection, real-time analytics, and IoT data processing.
The family of Amazon Kinesis comprises:
- Amazon Kinesis Data Streams: Ingest and store large streams of data records in real-time.
- Amazon Kinesis Video Streams: Ingest, store, encrypt, and index video streams in real-time.
- Amazon Kinesis Data Firehose: Collects, transforms (by using Lambda functions), and delivers data to destinations.
- Amazon Kinesis Data Analytics: SQL-based engine that analyzes streaming data in real-time using standard SQL queries.
In this blog post we will focus on Amazon Kinesis Data Streams.
Key Features
- Enhanced Fan-Out: This feature allows multiple consumers to read data from the same shard concurrently without affecting each other's throughput. It ensures each consumer can read all the data in real-time without being limited by the read throughput of other consumers.
- Scaling: Scaling in Kinesis Data Streams involves adding or removing shards. This can be done manually through the AWS Management Console or programmatically using the AWS SDKs or CLI. Scaling allows you to adjust the throughput capacity of your stream to handle varying data ingestion rates.
- AWS Ecosystem: Kinesis Data Streams integrates seamlessly with other AWS services like AWS Lambda, Amazon Kinesis Data Analytics, Amazon Kinesis Data Firehose, Amazon EMR, Amazon Redshift, and more. This allows you to build end-to-end data processing pipelines using a combination of these services.
- Low Latency: Provides the ability to ingest and process data in real-time with low latency, enabling immediate reaction to incoming data. For throughput 1 MB/s the latency is ~100 ms.
- Message Retention: Supports message retention for up to 7 days by default.
- Message Ordering: Provides message ordering to ensure messages are delivered in the order they were published.
- Data Encryption: Supports server-side encryption using AWS Key Management Service (KMS) to protect data at rest.
- Schema Registry: Supports schema registry through the AWS Glue Schema Registry allowing to define and manage schemas for the event data.
- Monitoring & Metrics: Provides detailed metrics and monitoring through Amazon CloudWatch, allowing you to track shard-level and stream-level metrics.
Architecture
Amazon Kinesis architecture typically involves the following components:
- Producers: Sources that push data records into the Kinesis Stream. These can be applications or devices that generate data in real-time, such as IoT devices, web servers, log producers, or clickstream data.
- Shards: Shards are the fundamental unit of scalability for Kinesis Data Streams. They represent the provisioned throughput capacity of the stream. Each shard has a sequence of data records that are ordered and uniquely identified by a sequence number.
- Consumers: Applications or services that consume data from Kinesis Streams for further processing or analysis.
Use Cases
Real-Time Analytics
- Analyzing log data, monitoring applications, and generating real-time metrics.
IoT Data Processing
- Collecting and processing data from IoT devices for real-time insights and actions.
Real-Time ETL
- Extracting, transforming, and loading data in real-time for data warehousing and analytics.
Clickstream Analytics
- Processing and analyzing user activity data from websites and applications.
Video Processing
- Ingesting and analyzing video streams for security monitoring, customer experience analysis, or machine learning.
Strengths
- Security: Kinesis Data Streams integrates with AWS Identity and Access Management (IAM) to control access to resources. You can define IAM policies to manage who can perform actions such as creating streams, putting records, or reading records.
- Scalability: Easily scales to handle high-throughput data streams.
- Managed Service: Reduces the operational burden with fully managed services like Kinesis Data Firehose.
- Flexible Data Retention: Allows for flexible retention periods, up to 7 days for Kinesis Data Streams (same as the default of Apache Kafka).
- Documentation: Amazon has a well established easily and free accesible documentation for all its services.
Weaknesses
- Shard Consumer Limits: Each shard can only support a certain number of consumers simultaneously. If multiple applications or services need to read from the same stream, they might compete for resources, leading to performance bottlenecks.
- Cost: Can become expensive at high throughputs and with long data retention periods.
- Complexity: Managing and optimizing shards for Kinesis Data Streams can be complex.
- Shard Scaling Limits: Managing shard scaling can be challenging. While Kinesis Data Streams can handle large-scale data ingestion, there are practical limits to the number of shards and throughput, requiring careful planning and monitoring.
- Manual Scaling: since Kinesis does not provide a native auto-scaling solution, scaling up or down typically requires manual intervention or custom scripts to adjust shard count based on data volume changes
- Knowledge Requirements: Developers and administrators need a good understanding of AWS services, Kinesis architecture, and real-time data streaming to effectively use Kinesis Data Streams.
- Processing Latency: While Kinesis Data Streams is designed for real-time data processing, there can be some latency in data processing depending on the architecture and the number of consumers.
Cost
Amazon Kinesis pricing is based on several factors, including:
- Kinesis Data Streams:
- Shards: Charged per shard hour.
- PUT Payload Units: Charged per million PUT payload units.
- Extended Data Retention: Additional charges for retaining data beyond 24 hours.
- Kinesis Video Streams:
- Data Ingestion and Storage: Charged per GB ingested and stored.
- Data Retrieval: Charged per GB retrieved.
- Additional Costs: Using enhanced fan-out for multiple consumers to read from the same shard without contention incurs additional costs, which can add up in high-traffic scenarios.
To estimate costs and capacity requirements for Amazon Kinesis Data Streams (or any other Amazon Service), the AWS Pricing Calculator can be used.
Maturity Level
- Mature (2013): Amazon Kinesis is a mature and well-established service within the AWS ecosystem. It has been widely adopted across various industries for real-time data processing and analytics. Continuous improvements and feature additions by AWS ensure that Kinesis remains competitive and capable of handling evolving data streaming needs.
AWS MSK
Keep in mind that Amazon provides also the service MSK (Managed Streaming for Apache Kafka). Amazon MSK is a fully managed service provided by AWS that simplifies the setup, scaling, and management of Apache Kafka clusters. It automates administrative tasks such as hardware provisioning, software patching, setup, configuration, and backups. In a nutshell, it is Apache Kafka with all its characteristcs but it is managed by Amazon.
Conclusion
Amazon Kinesis is a powerful and flexible platform for real-time data streaming and processing. Its ability to handle high-throughput data, seamless integration with other AWS services, and managed service offerings make it an attractive choice for organizations looking to derive real-time insights from their data. However, careful consideration of cost and potential complexity is necessary when designing and implementing solutions with Kinesis. Overall, it is a robust solution for modern data-driven applications that require real-time processing capabilities.
Francine Anestis
My diploma thesis as well as my internship being on ETL, Analysis and Forecasting of Big Streaming Data, I am keen on learning more and immersing myself in Data Engineering and Data Space in general. Building data pipelines, using Kafka, databases and algorithms captivated me during my studies as Electrical and Computer Engineer and as a result I decided to dedicate myself on Data Engineering. I am very excited starting my learning and career path at Big Industries. Regarding my skills, if I had to choose one programming language and a platform, I would say that Python and Kafka are my strongest assets, but I am looking forward to extending that list.