Why Azure might overtake AWS in its data services offerings
Microsoft enterprise cloud services are independently validated through certifications and attestations, as well as third-party audits. In-scope services within the Microsoft Cloud meet key international and industry-specific compliance standards, such as ISO/IEC 27001 and ISO/IEC 27018, FedRAMP, and SOC 1 and SOC 2. They also meet regional and country-specific standards and contractual commitments, including the EU Model Clauses, UK G-Cloud, Singapore MTCS, and Australia CCSL (IRAP). In addition, rigorous third-party audits, such as by the British Standards Institution and Deloitte, validate the adherence of our cloud services to the strict requirements these standards mandate.
With Security tightly ingrained with its AD offerings, Microsoft currently continues to evolve its security and data integrity in Azure. Core advantages of Security in Azure are as follows:
- Tightly integrated with Windows Active Directory
- Simplified cloud access using Single Sign On
- Single and Multi Factor authentication support
- Rich protocols Eg: Federated Authentication (WS-FEDERATION), SAML, OAuth 2.0 (Version 1.0 still supported), Open ID Connect, Graph API, Web API 2.0 (in conjunction with Authentication_JSON_AppService.axd and Authorize attribute)
- Data/BI Offerings in the Cloud
Microsoft Azure relatively has all the required components to support data related and business intelligence in various formats.
The core database and data collection/integration formats in Azure are as follows:
- Data Factory
Data Factory is a cloud-based data integration service that orchestrates and automates the movement and transformation of data. You can create data integration solutions using the Data Factory service that can ingest data from various data stores, transform/process the data, and publish the result data to the data stores.+
Data Factory service allows you to create data pipelines that move and transform data, and then run the pipelines on a specified schedule (hourly, daily, weekly, etc.). It also provides rich visualizations to display the lineage and dependencies between your data pipelines, and monitor all your data pipelines from a single unified view to easily pinpoint issues and setup monitoring alerts.
- SQL Server Integration Services (SSIS)
Leveraging SSIS which is the core tool leveraged for discontinued development across various teams, one can move data into and out of Azure to on premise or other cloud environments based on one's needs. SSIS can integrate databases and data warehouses in the Azure cloud and also enable individuals to drive templated based development efforts with ease.
Typically compared with AWS data pipeline:
AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premise data sources, at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR.
A pipeline schedules and runs tasks. You upload your pipeline definition to the pipeline, and then activate the pipeline. You can edit the pipeline definition for a running pipeline and activate the pipeline again for it to take effect. You can deactivate the pipeline, modify a data source, and then activate the pipeline again. When you are finished with your pipeline, you can delete it. Task Runner polls for tasks and then performs those tasks. For example, Task Runner could copy log files to Amazon S3 and launch Amazon EMR clusters. Task Runner is installed and runs automatically on resources created by your pipeline definitions. You can write a custom task runner application, or you can use the Task Runner application that is provided by AWS Data Pipeline.
- Azure SQL Data Warehouse
The SQL data warehouse for Azure maintains storage for data in the realm of files and blobs. This allows the easy conjunction between dimensional data and measures within the platform. Core documentation can be found here: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-overview-what-is
Typically Compared with AWS Redshift:
Redshift supports two kinds of sort keys: compound and interleaved. A compound sort key a combination of multiple columns, one primary column and one or more secondary columns. A compound sort key helps with joins and where conditions; however, the performance drops when the query is only on secondary columns without the primary column. A compound sort key is the default sort type. In interleaved an sort, each column is given an equal weight. Both compound and interleaved require a re-index to keep the query performance level high. The architecture of these systems differ but the end goal is the storage and processing of vast amounts of data down to second or milli second based result generation. Focusing on this aspect I am going to give a more detailed insight on Redshift which is a node based peta byte scaled database as well as a high level overview of what I recently implemented.
Note: The above diagram is from the Redshift Warehousing article (http://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html)
If you pay close attention to the diagram above the compute nodes is responsible for all the data processing and query transactions on the database and the data nodes contain the actual node slices. The Leader Node is the service that monitors the data connections against the Redshift cluster and is also responsible for the query processing (almost like a syntax checker on the query and functions leveraged). It then transfers the query across to the Compute Nodes whose main responsibility is to find the data slices from the data nodes and communicate with one another to determine the way the transaction needs to be executed. It is similar to a job broker except that this is more real time than non real time.
It is similar to the analogy of using a bucket..... Consider this:
You take a bucket and keep filling water into it, eventually the bucket get filled..... however what happens when there is an enormous amount of water that needs to be contained. Either grab a massive bucket or use multiple buckets to store the water (so the second option actually depicts the Redshift architecture....)
The concept scaling up implies not only adding a bucket for storage but also a mechanism to ensure that the pipeline flow goes to an empty bucket which is nothing but out compute node. But there is the price of the bucket and the cost of the mechanism that is required to populate the bucket as well....
The cons of using Redshift are as follows:
- You need an Administrator to monitor the Redshift cluster at all times
- The power of Redshift lies in the manner the database is queried, so a SQL developer/DBA with understanding of the Redshift internals is definitely a must
- Upfront cost is slightly on the higher side but over a period in time the cost will be justified with more nodes being added to the cluster
Redshift major advantage is the fact that it allows both JDBC and ODBC drivers so definitely if you want to query the database. The Postgres driver can be found as follows:
- Azure Redis Cache
Azure Redis Cache is based on the popular open-source Redis cache. It gives you access to a secure, dedicated Redis cache, managed by Microsoft and accessible from any application within Azure.
Azure Redis Cache is available in the following tiers:
Basic—Single node, multiple sizes, ideal for development/test and non-critical workloads. The basic tier has no SLA.
Standard—A replicated cache in a two node Primary/Secondary configuration managed by Microsoft, with a high availability SLA.
Premium—The new Premium tier includes a high availability SLA and all the Standard-tier features and more, such as better performance over Basic or Standard-tier Caches, bigger workloads, disaster recovery, and enhanced security. Additional features include:
Redis persistence allows you to persist data stored in Redis cache. You can also take snapshots and back up the data which you can load in case of a failure.
Redis cluster automatically shards data across multiple Redis nodes, so you can create workloads of bigger memory sizes (greater than 53 GB) and get better performance.
Azure Virtual Network (VNET) deployment provides enhanced security and isolation for your Azure Redis Cache, as well as subnets, access control policies, and other features to further restrict access.
Basic and Standard caches are availabe in sizes up to 53 GB, and Premium caches are available in sizes up to 530 GB with more on request.
Azure Redis Cache helps your application become more responsive even as user load increases. It leverages the low-latency, high-throughput capabilities of the Redis engine. This separate, distributed cache layer allows your data tier to scale independently for more efficient use of compute resources in your application layer.
Unlike traditional caches which deal only with key-value pairs, Redis is popular for its highly performant data types. Redis also supports running atomic operations on these types, like appending to a string; incrementing the value in a hash; pushing to a list; computing set intersection, union and difference; or getting the member with highest ranking in a sorted set. Other features include support for transactions, pub/sub, Lua scripting, keys with a limited time-to-live, and configuration settings to make Redis behave more like a traditional cache.
Another key aspect to Redis success is the healthy, vibrant open source ecosystem built around it. This is reflected in the diverse set of Redis clients available across multiple languages. This allows it to be used by nearly any workload you would build inside of Azure.
Simple query language - no complex features
Is (out-of-the-box) not reachable from other regions.
You're always limited to the maount of memory
Sharding over multiple instances is only possible within your application - Redis doesn't do anything here
You pay per instance no matter how the load or the number of requests are.
If you want redundancy of the data you need to setup replication (not possible between different regions)
You need to work yourself for high availability
*Redis/memcached are in-memory stores and should generally be faster than DynamoDB
*Exposed as a API
Typically compared with AWS DynamoDB
In DynamoDB, data is partitioned automatically by its hash key. That’s why you will need to choose a hash key if you’re implementing a GSI. The partitioning logic depends upon two things: table size and throughput.
DynamoDB supports following data types:
Scalar – Number, String, Binary, Boolean, and Null.
Multi-valued – String Set, Number Set, and Binary Set.
Document – List and Map.
Has a query language which is able to do more complex things (greater than, between etc.)
Is reachable via an external internet-facing API (different regions are reachable without any changes or own infrastructure)
Permissions based on tables or even rows are possible
Scales in terms of data size to infinity
You pay per request -> low request numbers means smaller bill, high request numbers means higher bill
Reads and Writes are different in costs
Data is saved redundant by AWS in multiple facilities
DynamoDB is highly available out-of-the-box
- Azure Document DB
Planet-scale, highly-available NoSQL database service on Microsoft Azure. Azure DocumentDB is Microsoft’s multi-tenant distributed database service for managing JSON documents at Internet scale. DocumentDB is now generally available to Azure developers. In this paper, we describe the DocumentDB indexing subsystem. DocumentDB indexing enables automatic indexing of documents without requiring a schema or secondary indices. Uniquely, DocumentDB provides real-time consistent queries in the face of very high rates of document updates. As a multi-tenant service, DocumentDB is designed to operate within extremely frugal resource budgets while providing predictable performance and robust resource isolation to its tenants. This paper describes the DocumentDB capabilities, including document representation, query language, document indexing approach, core index support, and early production experiences
Azure DocumentDB guarantees less than 10 ms latencies on reads and less than 15 ms latencies on writes for at least 99% of requests. DocumentDB leverages a write-optimized, latch-free database engine designed for high-performance solid-state drives to run in the cloud at global scale. Read and write requests are always served from your local region while data can be distributed to other regions across the globe. Scale data throughput and storage independently and elastically—not just within one region, but across many geographically distributed regions. Add capacity to serve millions of requests per second at a fraction of the cost of other popular NoSQL databases.
Easily build apps at planet scale without the hassle of complex, multiple data center configurations. Designed as a globally distributed database system, DocumentDB automatically replicates all of your data to any number of regions worldwide. Your apps can serve data from a region closest to your users for fast, uninterrupted access.
Typically compared with AWS DynamoDB
Easily spin up enterprise-grade, open source cluster types, guaranteed with the industry’s best 99.9% SLA and 24/7 support. Our SLA covers your entire Azure big data solution, not just the virtual machine instances. HDInsight is architected for full redundancy and high availability, including head node replication, data geo-replication, and built-in standby NameNode, making HDInsight resilient to critical failures not addressed in standard Hadoop implementations. Azure also offers cluster monitoring and 24/7 enterprise support backed by Microsoft and Hortonworks with 37 combined committers for Hadoop core—more than all other managed cloud providers combined—ready to support your deployment with the ability to fix and commit code back to Hadoop. Use rich productivity suites for Hadoop and Spark with your preferred development environment such as Visual Studio, Eclipse, and IntelliJ for Scala, Python, R, Java, and .Net support. Data scientists gain the ability to combine code, statistical equations, and visualizations to tell a story about their data through integration with the two most popular notebooks, Jupyter and Zeppelin. HDInsight is also the only managed cloud Hadoop solution with integration to Microsoft R Server. Multi-threaded math libraries and transparent parallelization in R Server handle up to 1000x more data and up to 50x faster speeds than open source R, helping you train more accurate models for better predictions than previously possible.
A thriving market of independent software vendors (ISVs) provide value-added solutions across the broader ecosystem of Hadoop. Because every cluster is extended with edge nodes and script action, HDInsight lets you spin up Hadoop and Spark clusters that are pre-integrated and pre-tuned with any ISV application out-of-the-box, including Datameer, Cask, AtScale, and StreamSets.)
Integration with the Azure Load Balancer:
Azure Load Balancer delivers high availability and network performance to your applications. It is a Layer 4 (TCP, UDP) load balancer that distributes incoming traffic among healthy instances of services defined in a load-balanced set.
Azure Load Balancer can be configured to:
- Load balance incoming Internet traffic to virtual machines. This configuration is known as Internet-facing load balancing.
- Load balance traffic between virtual machines in a virtual network, between virtual machines in cloud services, or between on-premises computers and virtual machines in a cross-premises virtual network. This configuration is known as internal load balancing.
- Forward external traffic to a specific virtual machine.
All resources in the cloud need a public IP address to be reachable from the Internet. The cloud infrastructure in Azure uses non-routable IP addresses for its resources. Azure uses network address translation (NAT) with public IP addresses to communicate to the Internet.
*not similar to the elastic emr load balancer
Shows live changes in analytics.
Azure in itself is very user-friendly, HDInsight is a great addition.
Works in tandem with other Microsoft technologies like Power BI and Excel seamlessly as end clients
When loading large volumes of data, issues in data might stop the load mid way due to either data corruption errors issues in data might stop the load mid way due to either data corruption errors or some latency issue. The entire load needs to be tranisitioned from start.
- Azure HDInsight is a service that provisions Apache Hadoop in the Azure cloud, providing a software framework designed to manage, analyze and report on big data.
- HDInsight clusters are configured to store data directly in Azure Blob storage, which provides low latency and increased elasticity in performance and cost choices.
- Unlike the first edition of HDInsight , now it is delivered on Linux – as Hadoop should be, which means access to to HDP features. The cluster can be accessed via Ambari in the web browser, or directly via SSH.
- HDInsight has always been an elastic platform for data processing. In today’s platform, it’s even more scalable. Not only can nodes be added and removed from a running cluster, but individual node size can be controlled which means the cluster can be highly optimized to run the specific jobs that are scheduled.
- In its initial form, there were many options for developing HDInsight processing jobs. Today, however, there are really great options available that enable developers to build data processing applications in whatever environment they prefer. For Windows developers, HDInsight has a rich plugin for Visual Studio that supports the creation of Hive, Pig, and Storm applications. For Linux or Windows developers, HDInsight has plugins for both IntelliJ IDEA and Eclipse, two very popular open-source Java IDE platforms. HDInsight also supports PowerShell, Bash, and Windows command inputs to allow for scripting of job workflows.
Typically Compared with AWS EMR
Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.
Amazon EMR securely and reliably handles a broad set of big data use cases, including log analysis, web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics.
AWS EMR is more mature when compared to HDInsight however HDInsight development continues to progress at a rapid pace compared to the phased approach for AWS EMR.
AWS EMR pricing is also higher than Azure HDInsight in terms of storage usage.
EMR release velocity is better than Azure HDInsight.
- Amazon EMR provides a managed Hadoop framework that simplifies big data processing.
- Other popular distributed frameworks such as Apache Spark and Presto can also be run in Amazon EMR.
- Pricing of Amazon EMR is simple and predictable: Payment can be done on hourly rate. A 10-node Hadoop can be launched for as little as $0.15 per hour. Because Amazon EMR has native support for Amazon EC2 Spot and Reserved Instances, 50-80% can also be saved on the cost of the underlying instances.
- It also is in vogue due to its easy usage capability. When a cluster is launched on Amazon EMR the web service allocates the virtual server instances and configures them with the needed software for you. Within minutes you can have a cluster configured and ready to run your Hadoop application.
- It is resizable, the number of virtual clusters depending on the processing needs can be easily contracted or expanded.
- Amazon EMR integrates with popular business intelligence (BI) tools such as Tableau, MicroStrategy, and Datameer. For more information, see Use Business Intelligence Tools with Amazon EMR.
- You can run Amazon EMR in a Amazon VPC in which you configure networking and security rules. Amazon EMR also supports IAM users and roles which you can use to control access to your cluster and permissions that restrict what others can do on the cluster. For more information, see Configure Access to the Cluster.