Did you know that Google uses a technology which can scan around billions of rows in tens of seconds? A major point to consider here is that, this scan is done without using any indexing. Sounds amazing, right? Let us understand a bit about this amazing technology called “Dremel”.
Dremel can execute SQL-like queries over large datasets and produce results accurately within seconds. The size of the data it can execute in just tens of seconds is around 20 TB. So, it is safe to say that if someone has a basic knowledge of SQL, they can easily handle large data sets efficiently.
What is BigQuery?
BigQuery is a service which is available to the public for business or developers to use. It has all the core features of Dremel. We can say Dremel was used internally, but BigQuery was available in general for public. When I say core features, it means that, we can consider having 2 core technologies again.
- Columnar storage
- Tree architecture
Columnar Storage: In general, databases store data in the form of rows. But, here the data will be stored in columns. For example, let’s say, we have a table with ID and Name as columns. The data has around 2 rows with values 1,A and 2,B for ID,Name. So, row1 has 1,A and row2 has 2,B. Now, in columnar storage, the entire record is separated into column values and stores each value on a different storage volume.
Tree architecture: The main challenge google faced was sending the queries across tens of thousands of machines and gathering information in a matter of seconds. For this, they came up with a solution called the “Tree Architecture”. The basic concept here was to divide and distribute the queries to a massively large sets of parallel trees and to extract the results from the leaves of a tree at a vast speed.
BigQuery Vs Other Cloud Services:
Let us compare BigQuery with some of the most popular cloud services like Amazon Redshift and Microsoft Azure.
Amazon Redshift: Amazon Redshift’s main concept is based on nodes. These nodes must be configured, managed and deployed like an on-premise cluster. It has no hardware to maintain and can administer physically. While deploying, we have to mention clearly and plan accordingly on the required size and what configurations are to be defined. As mentioned earlier, it is purely based on the node concept. We are required to have a leader-node and access everything through this leader-node and process a scheduled update many times throughout the year. The price might be flexible, but it needs to be prepaid. The payment has to be done every three years in advance.
Microsoft Azure SQL Data Warehouse: Microsoft Azureis a cloud-based Enterprise Data Warehouse (EDW) solution. This data warehouse processes multiple database SQL queries simultaneously. This is an MPP (Massively Parallel Processing) architecture. This uses control nodes, which by default redirects queries parallelly to all other compute nodes in a to-and-fro manner dynamically as per the requirements. But, none of these resources are free. These are sold as per the pre-defined service levels in data warehouse units (DWU).
BigQuery: In fact, both the above services would require a consistency in monitoring, tuning the performance and configuring all these nodes which essentially is not required in BigQuery. When we say it is a fully managed service, it means these are the main factors to be considered. Technically speaking, in terms of ETL, query and user administration stand point is easy to manage, inherently providing the ability to manage these virtual resources with a provision of higher administrative savings, due to which we can achieve the ability to support Artificial Intelligence and Machine Learning integrations as well.
Benefits of Using Google BigQuery:
Fast: We can execute ad hoc queries on massive data sets (Terabytes) and return the results in seconds. Organizations can analyse all this massive data without any hardware and complex software investment.
Easy to Analyse: Data sets with billions of rows can be returned in seconds. The queries are basic SQL style queries. So, analysing them is very simple and easy as this is done using a simple UI or REST interface.
Reliable & secured: We can create a similar environment across multiple sites. When any of the environment is collapsed or down, the data can still be accessed from other locations through replication process.
Scalable: We can scale through terabytes of data. The cost depends on how much data you will be using.When handling projects, each individual project is associated with one billing account, but multiple projects can also be billed on the same account. There is no need to worry about the amount of data in terms of size, because the payment is done only for the usage of the required data, but not for the amount of data stored.
As data is increasing day by day, it is a challenge for many organizations to maintain the data in a secured way. Most of them prefer to move the data to the cloud. Moreover, there is also a need for many organizations to produce quality business intelligence for handling the new challenges and opportunities that provides flexibility with other data warehouse services at the lowest cost.
Google BigQuery in a way achieves these challenges by removing all the burdens of managing, monitoring, and maintaining the data warehouse infrastructure to achieve their goals in obtaining business intelligence rather than focusing on the data process maintenance and creation. Therefore, we would strongly recommend the usage of Google BigQuery across the entire organization to leverage their business in terms of cost reduction and also to meet their daily challenges as per the required business changes effectively.
Contact for further details
Sr. Technology Specialist – Analytics Big Data