Day to day, our brain collects and processes a lot of information, so we can make decisions needed in our work and routine life. Like people do, companies also require the unit for storing and organizing the information; this tool is named a data warehouse.
Let’s take a deeper look at what the enterprise data warehouse is, how it works, and what features and types it has.
What hides under Enterprise Data Warehouse?
If you are interested in how much information сan be placed in the EDW, this fact is for you: the number of data US Netflix subs have used in the last month is 527 million terabytes. So, an EDW is not just a database.
EDW is a corporate repository that stores and manages all historical business data of the enterprise. This data goes from the CRM, ERP, flat files, and physical recordings, and it should be in the one storage for a versatile analysis. So, the enterprise can manage large data sets without several databases, just an EDW.
This kind of data storing is used in business intelligence, a scope of technologies and methods of transforming raw data into actionable insights.
How to distinguish an enterprise data warehouse from the usual data warehouse?
Every warehouse, first of all, is the database, connected with sources of raw data with the help of data integration tools and analytical interfaces. The warehouse is the place for data storing, transforming, and moving it to the end-user.
The main difference between the usual warehouse and enterprise warehouse is in its features and architectural diversity. EDW can include small databases that are more suitable for customer’s requests.
Let’s consider what functions an enterprise warehouse provides and with what models of data it works.
Enterprise Data Warehouse concepts
The pillars of an enterprise data warehouse as a technological phenomenon are its functions.
- EDW serves as the ultimate storage of all corporate databases.
- It demonstrates the data source. Data comes to EDW from CRMs, Google Analytics, IoT devices, etc. If data lies in different systems, it is impossible to manage it, so EDW should store the source data in one place. Moreover, new data is emerging inside and outside the company, so managing such kind of data needs special infrastructure.
- It stores structured data. Data from the EDW is standardized and structured, so end-users can request it via BI interfaces and create reports. This moment differs EDW from the data lake that stores unstructured data needed for analysis. Data lakes are used by data scientists or data engineers when they work with large sets of raw data.
- It structures data. Enterprise warehouse structures data around a particular object named data model. It is needed in cases when business data are related to different domains. Also, metadata can be used as an additional explanation of the data sources. For example, a sales region or total sales score of certain products can play the role of the subject or data model.
- It streamlines data. There is a historical kind of data in the warehouse: it describes past events. So, it is needed to streamline data by the time to see and understand the tendencies on the market.
- It provides lifetime storage. Data in the warehouse is never deleted from it as it is needed for analytics. Data can be updated or modified, but not erased by end-users. It can undergo common revisions once in a few years, when data engineers can delete irrelevant information.
Data warehouse classification
Data warehouses can differ for different kinds of business, data scope, analysis level, budget, security, etc. Also, you can always customize your data warehouse according to your company’s needs. Let’s consider the most popular data warehouse types.
Classic data warehouse
This is unified physical storage with its own specialized hardware and software. If you are using a classic ETL, you don’t need to implement integration tools between several databases. EDW can connect with data sources via API, so it allows ETL to get information and transform it without interruptions. So, in the role of place of actions can be the staging area (data transforming place before the loading to the data warehouse), or the warehouse itself.
The work with the classic data warehouse is not complicated by additional abstraction layers so that data scientists can manage data flow on the preprocessing side without problems.
The classic data warehouse has its disadvantages, as other data warehouse types do, but they can vary according to its realization in the company:
- Expensive hardware and software;
- Staff is needed (data engineers and DevOps) for data processing, setup, and maintenance of its technical infrastructure.
Where to use:
The classic data warehouse can be used in all kinds of enterprises that need data processing. This kind of data warehouse can transform into different architectural styles of data platforms and scale up and down.
Virtual data warehouse
This is an alternative solution that is used instead of a classic data warehouse. In simple words, it contains several databases connected virtually, so they work as a single system.
This approach is convenient for companies: data stays in their sources but can be pulled with the help of analytics tools. You can use a virtual data warehouse if you don’t want to work with basic infrastructure, or when your data is easy to manage.
Disadvantages of the virtual data warehouse:
- Data stored in the virtual data warehouse also requires software to transform it for reporting and end-users.
- Complex data queries can take a lot of time if needed fragments of data stand in different databases.
- Several databases require permanent software maintenance and costs.
Where to use:
It is a solution for companies whose raw data is in standard form and doesn’t require complex analytics. Also, it suits enterprises that just begin or use BI from time to time.
Cloud data warehouse
Cloud data warehouses have become a usual thing for enterprises in the last years. So, you can choose one of the most popular solutions of the warehouse-as-a-service on the market:
Snowflake suggests EDW as a standalone service, other providers allow companies to use fully-managed warehousing as part of their BI tools. If you choose a cloud data warehouse, you don’t need to build your physical servers, databases, and managing tools: its infrastructure is maintained for you. Such kind of warehouse’s pricing depends on the quantity of needed memory and computing capabilities for processing requests.
Disadvantages of the cloud data warehouse:
The main drawback of this solution is data security. While you should check your cloud warehouse vendor and be sure that you can trust this company. Business-data is a very sensitive thing, and it can be a risky moment when your information isn’t in your hands.
Where to use:
Cloud data warehouses can be a great solution for any size of enterprises. With a warehouse, you also get the integration of managing data, warehouse maintenance, and BI support.
The architecture of enterprise data warehouse
Let’s consider the most popular approaches to enterprise data warehouse architecture that improve warehouse capabilities. The data pipeline is divided into three layers:
- Data sources (layer with raw data);
- Warehouse with this ecosystem;
- Analytical tools (user interface).
Tools for extraction, transforming and data loading belong to the separate layer, named ETL. The ETL stage stands between the raw data layer and the moment when data comes to the warehouse.
Data also can be transformed at the moment of loading to the warehouse. So, the warehouse requires some functionality for cleaning, standardization, and dimensionalization; and these tasks can define the kind of warehouse or architecture. Let’s take a deeper look at how the requirements of the organization can influence the warehouse architecture.
One-tier architecture
One-tier architecture is the simplest type of warehouse. In this case, a warehouse is a relational database with modules for the usage of multidimensional data. Also, it can be a database that performs elementary actions and separates domain-specific information
to make easy access.
EDW with one-tier architecture means a database that is directly connected with analytical interfaces for end-users requests. Such kind of connection can be tricky:
- A database “transforms” into a warehouse when it stores more than 100GB of data. Direct connection with this gigantic scope of information leads to messy query results and low processing speed.
- Data request in case of using a warehouse with one-tier architecture takes a precise input. The system filters non-required data, and this process restricts the work of the presentation tools.
- Analytical features, reporting, and system flexibility also are limited.
One-tier architecture approach demonstrates slowness and unpredictability of work, so it doesn’t suit large-scale data platforms. It can be enhanced with low-level instances if it is needed to simplify data access and perform advanced data queries.
Two-tier architecture
A two-tier architecture is complemented with the data mart layer between the user interface and EDW. A data mart contains information related to the particular domain, so it is a small database, a part of EDW, with dedicated information for sales departments, marketing, etc.
A data mart implementation requires additional resources: establishing hardware and integration with another part of an architecture. These costs are reasonable for several moments:
- It simplifies access to the required data for the company’s departments: every data is related to a certain domain.
- Data marts restrict access to the data for end-users, and it increases the EDW security.
Three-tier architecture
The three-tier architecture approach allows companies to use in their work data online analytical processing (OLAP) cube. This type of database provides data from several dimensions. It makes it possible to compile data in several dimensions and move between these dimensions.
In simple words, the OLAP cube looks like several Excel tables connected with each other. Data in the OLAP cube is segmented in multiple ways at the same time, for example, by locations and periods of time, so it is called multidimensional data.
OLAP cube is a perfect solution for detailed reporting. With its help, customers can select and mix needed data. OLAP cubes optimized for work with warehouses; they can be used both with EDW and provide access to all corporate data scope. It is an available and common solution; there are no difficulties with implementation, as almost all warehouse vendors offer OLAP as a service.
Kinds of databases used in data architecture
Let’s take a deeper look at the fundamental difference between a data warehouse, a data mart, and a data lake.
- A data warehouse is used for storing structured data, and it can accommodate from 100GB to infinity. A data warehouse provides comprehensive results for end-users and querying tools.
- Data lakes are needed for storing raw or mixed data and are used for machine learning, large data scopes, and data mining. Also, data lakes exploited in BI processes as an alternative to the ETL process: raw data is loaded to the data lake and transformed. Data lakes can be excessively confused for reaching the structured data, its usage demonstrates both pros and cons.
- Data marts can be used instead of a data warehouse. In this case, the model allows using several data marts for spreading data by domains and connecting them to each other. Data marts can store a small quantity of information (less than 100GB), so they don’t suit enterprise requirements as enterprise data warehouses. Generally, data marts are used to segment large data warehouses into several more convenient ones.
Enterprise Data Warehouse elements
Let’s list all components that constitute a data warehouse platform and build some kind of EDW glossary.
DW database: as a rule, data warehouses are relational databases, they contain a database management system and additional storage for metadata.
ETL (extraction, transforming, and loading) layer: tools that fulfill an actual connection with the source data, its extraction, and loading to the place for transforming. It is essential to distinguish ETL from the ELT approach. The ELT means data loading before transforming and transforming performs in the warehouse.
Meta-data module: meta-data are explanations and hints for users or administrators about data subject or domain. It can be technical meta-data, for example, the initial source, or business meta-data, for instance, a region of sales. So, all meta-data stores in a separate module that is managed by the meta-data manager.
Reporting level or BI-interface: tools that provide access to the data for end-users. It is needed for data visualization, building reports, and extraction of particular parts of data.
Sources: databases that store raw data.
Staging area: it is a place where data is loading before going to the EDW. In the staging area, data is cleared and transformed into the needed data model. There also can be tools for data quality management.
The planning of the EDW deployment is a complicated and time-consuming process: you should figure out with data warehouse construction, its tools, features, your business requirements, and end-users needs. While, it will be a good solution to turn to the experienced specialists in warehousing, BI, and ETL. They will help you to choose the most suitable technical solution that meets your company’s size and specialization.