Data is of increasing paramount importance in this information age. A rather infamous example of how obtaining the right data and using it to impact outcomes is the one associated with Cambridge Analytica. Now dismantled, the organization used data from Facebook to influence the opinions of the people. Thus, it influenced the voting patterns of the electorates of several nations. These include the United States and the United Kingdom. As such, data is now considered a commodity worth more than most others, including oil.
Tactical or strategic decisions and outcomes based on data analytics may seem to have marginal influence. Regardless, the difference they make can be significant. In the case of Cambridge Analytica in the US, the firm targeted swing voters. Fewer than 80,000 votes decided the result. This is a very small margin and is relevant to business decisions. Studies have shown, for example, that a 5% increase in customer retention rates could result in as much as a 25% – 95% increase in profits.
With most businesses going digital, these organizations have access to all kinds of data. It is data from their customers through apps, social media, and sign-up forms. ETL tools are software programs that aid a business in extracting data. This data is a source of information to create a business strategy.
Business Intelligence (BI) refers to infrastructure which focuses on collecting, storing and analyzing data. The data results from a business’s operational activities and can improve business decision-making. A more recent definition identifies BI as “concepts and methods to improve business decision-making by using fact-based support systems”.
Yet, data collected by a business will often come in different forms. The ETL process is for extracting data from different sources, in different formats and storing it in data warehouses. Data warehouses are essentially “subject-oriented, integrated, time-variant and non-volatile collections of data in support of management’s decision-making process”.
It seems counter-intuitive, given the current state of the world economy and the companies participating in it. Yet, the collection of business data predates computers. In the past, business data was kept in hard copies. The emergence of relational database management systems in the ‘70s and the development of better techniques led to better methods of acquiring BI. The modern concepts of business intelligence were developed from the emergence of mainframes. Also, the widespread personal computing, spreadsheets and Structured Querying Language (SQL). Reports which would have taken weeks to compile without these techniques were now produced much faster. This was due to the availability of a wide range of data from different sources in data warehouses.
The first era of Business Intelligence (BI 1.0) was characterized by the ability of businesses to aggregate data and create reports. Most of the efforts centered around data integration, data quality, cleansing, and other data aggregation functions. Creating reports took long when compared to modern techniques. The waiting time between a report request and delivery would be disadvantageous in the current business environment. BI 2.0 introduced new functionality by enabling organizations to access near real-time processing. It introduced collaboration and self-service in report generation as well as ETL tools.
The current market has consumers who expect instant results. It is an increasingly connected and competitive market. Businesses are now designed to be more flexible and responsive to changes in the market. With data increasingly becoming accessible, easier to store and significant in generating competitive advantages, the tools which a business has access to for the processing of data could easily determine the success or failure of a business.
The ETL processes can be defined as the backbone of data warehousing structures. They determine the accuracy and the usefulness of both the data warehouse and the information derived from it. The process, as the name suggests, is split into three phases.
This is the stage where the data is loaded from its sources, in the format in which it is available. There are different strategies that can be used to implement the extraction stage.
Transformation is the simplest of the processes to explain. Perhaps it may be the most difficult to execute. The data used by organizations comes from different sources, as mentioned in an earlier section. Some sources may utilize different units of measurement. Perhaps the imperial system when others might rely on the metric system. Or the Euro as a mode of currency when the benchmark is the US Dollar. Differences are not limited to units of measurement but formats as well. A common confusion occurs when dealing with a date like the 4th of July 2012, for example. This can be represented as 04/07/12, 07/04/12, or 12/07/04, depending on the format used to record the date. If a data warehouse contains a significant portion of information with this type of error, the reliability of the information derived from it drops substantially. This leads to inaccurate conclusions and sub-optimal decisions.
Transformation typically happens in a ‘staging area’. It is where data is checked for quality and consistency while being transformed to make it compatible with the data warehouse.
Loading is the process whereby one integrates the data processed from the transform phase into the data warehouse. Transformation is a process that requires time. In the past, the technology significantly limited data loading speeds. This required the development of different strategies to load data.
Batch loading is a strategy that involves uploading all the newly acquired data into an existing data warehouse. This happens while the data warehouse itself is offline. This also provides a window for performance optimization to ensure the data warehouse runs smoothly. This is often done during off-hours. The setback with this strategy is the need to balance between the availability of the most recent information and minimizing data warehouse down-time. Frequent updates provide more recent data but the most interruptions.
Near real-time updating of the data warehouse, or data streaming, is a strategy that ensures the data warehouse has new data as it becomes available. This offers a strategic advantage in organizations that require flexibility. Also, those which need quick responses to changes in the market. However, continuously updating the data warehouse potentially presented performance issues. It is because of processing power and data bandwidth requirements. Advancements in processors and data bandwidth have eased some of these concerns.
However, there may still be an advantage to choosing batch processing over data streaming. When having complete, rather than partial, data results in different outcomes, batching may be a better option. A crude example is getting feedback regarding a product or service which reaches different markets at different times. Early adopters may present feedback which differs significantly from the general market. This is because of a market segment not being a fair representation of the overall market. An organization reacting quickly to the responses of a biased sample could result in decisions adversely impacting the rest of the market
The loading phase does not necessarily need to be the final step in the data warehousing operations of Business Intelligence. ELT (Extract, Load, Transform) loads the data collected to the data warehouse immediately and transforms it at a later stage.
Although this method is intuitively less reliable than ETL, due to possible inconsistencies in the quality of the data, there are powerful tools and methods which are making this process more useful to businesses. ETL not only improves the accuracy of the data, but it may also provide compliance and privacy advantages. On the other hand, ELT is quick and more flexible in terms of the data it can process. ELT is more complex and requires more space than the ETL process.
The advantages of ELT when implemented successfully include the reduction of waiting times with regards to data availability. Also, lower maintenance requirements, and a broader scope of data available. Further, the availability of cloud-based ETL tools has opened up a wide range of possibilities and options which will be explored further in the next section.
As previously mentioned, Business Intelligence can provide a competitive advantage. It can aide in the profitability of the company. ETL tools are a key player in obtaining relevant information from data. However, there are many factors to consider when choosing an ETL tool:
There are several ETL tools available. The established tools, including those created by IBM, Microsoft, and Oracle are mostly designed to work in batch. They were not designed with cloud-based computing in mind. Talend and Stitch are open-source and cloud-based with varying degrees of functionality. Xplenty is one of the solutions which offer both ETL and ELT functionality among other features. IronFocus take different approaches to determine the optimal tools suited to your business.
Regardless of the business model implemented, choosing the right ETL tools is a key strategic decision. The level of complexity, its functionality, and the cost should match business requirements. As well as organizational goals. Data analytics and the information extracted from the use of the right tool can be applied throughout the whole sales funnel.
For example, data solutions could be used to target changes to marketing to reduce the cost per lead and cost per acquisition experienced by the business, improving sales effectiveness and conversion rates, and enhance customer retention which, as mentioned earlier, can be a key driver of profit.