• June 10, 2016

Government Agency Data Analytics Platform

Government agencies know that harnessing analytics technologies to transform data into citizen insights is crucial in improving their service efficiency and effectiveness. The HPE Data Analytics Blueprint discussed in this paper provides a good reference on how to start.                                                                   

Everyday thousands of government clients such as citizens and businesses interact with government agencies either via internet based e-Services, through call centers or walk-in to government agencies. These interactions have generated a lot of data that, when harnessed, can provide valuable client’s insights for the agencies. Some useful answers that government agencies can harvest include:

  1. How effective are the policies and their implementation? Have the key performance indicators been met?
  2. What is the experience of clients interacting with agencies across different channels such as self-service, call center, online, or walk-in? Where are the areas of improvement?
  3. What is the citizens’ feedback on policies and agencies’ services? What are their views shared on social media platforms? How to better meet the client’s expectations?
  4. How can management reports be made available to decision makers in near real-time?
  5. Can the current services handle the future growth in client population? Where are the potential bottlenecks?
  6. How can policies be adjusted to better service clients? Where are the clients’ potential areas of need?
  7. How should data be shared across agencies so as to streamline the clients’ interactions with government?

Data Analytics Reference Architecture

To get to the answers, government agencies need a data analytics platform that consolidates data from relevant agencies, correlates the data to derive meaningful information and leverages advanced visualization tools to draw insights from the information. These insights can then be analyzed within the agency or shared with other government agencies.

The diagram below describes the key components of the Data Analytic Reference Architecture and how these components interact to collect, process and present data:

Screen Shot 2016-06-28 at 6.46.27 AM

Starting from the left panels of the diagram, raw data is first collected in form of data files from various sources and deposited at the Data Analytic Platform’s landing server via Secure File Transfer. Raw data usually comes in a variety of formats:

  • Transactional data – includes structured data such as transaction logs and electronic records collected from computer systems that service clients.
  • Web and social media data – includes clickstreams and interactive data from social media platforms such as Facebook, Twitter, LinkedIn and blogs.
  • Biometric data – includes the client’s biometric data such as fingerprints, generics, handwriting, iris, facial or other biometrics.
  • Human-generated data – includes unstructured or semi-structured data such as call center agents’ notes, voice recordings, emails, paper documents, surveys, and images.
  • Machine data – includes readings from sensors, meters and other devices from the Internet of Things (IoT).

For structured data such as transactional data, the data file format usually is text files with some pre-defined formats. In the case of unstructured data (aka big data), the files usually come in a wider variety of formats, holding a much bigger volume of records and have record volume that grows in high velocity.  These characteristics imply that more computation power and storage capacity should be allocated to process unstructured data.

 Data Extraction and Loading Process – The next step involves extraction of data records from the landing server’s data files and loading them into the Staging Data Stores. Data files are scanned for malicious codes or viruses and integrity checks are performed to ensure that there is no missing data or unauthorized tampering. Depending on whether the data is structured or unstructured, it is loaded into either structured or unstructured data store respectively.

Structured data store – Structured data refers to data with known entity type and relationships. The most efficient approach for structured data analysis is to load the structured data into a relational database (RDBMS) so that the data can be queried via SQL syntax commands. Depending on the data quality, data cleansing exercise may be required before the data is loaded into the Staging Database.

Unstructured data store – Unstructured data such as human generated and machine data are ill-suited to be stored in traditional relational databases that require the data format to be known before storing. Therefore the unstructured data is ideally stored in NoSQL Databases (also called data lakes) that store data either as name-value pairs, graphs or documents. Currently, the most popular unstructured data store is the Hadoop Distributed File System (HDFS). HDFS is a massively scalable, high available, distributed file system that allows large files (gigabytes to terabytes) to be stored in multiple Hadoop machine nodes across the network.

Data Transformation Process – Raw structured data usually comes in different formats. It has to be transformed into a common format before it can be aggregated and loaded into the Operational Data Store (ODS). The next step involves data transformation from Staging Data Store to ODS using Extract-Transform-Load (ETL) tools. For structured data, the source and destination data format are well-known and transformation scripts can be defined within the ETL tools. The ETL tools will execute the defined logic against all input data, highlight any data exceptions and load the transformed data into the ODS. For unstructured data, traditional ETL tools are not useful since there is no data structure to begin with. There are several approaches to unstructured data transformation. The default method is to write Map-Reduce programs that execute ‘map’ tasks in parallel before consolidating the results via the ‘reduce’ tasks. However, for more complex, multi-stage transformations, such as iterative graph or machine learning algorithms and interactive ad-hoc queries, the Apache Spark framework, which leverages distributed memory processing, offers much better performance than Map-Reduce.

Intelligent Data Operating Layer (IDOL) – There are Intelligent Data Operating Layer tools that can intelligently extract conceptual and contextual data relevance from unstructured data. Through entity extraction and natural language processing algorithms, IDOL such as HPE Autonomy IDOL can derive meaning from human generated data. IDOL also provides out-of-the-box useful machine learning features such as ‘optical characters recognition’, ‘facial detection’, ‘text analytics’ and ‘speech recognition’ to process images, videos, sound and text data without the need for data scientists to write complex data transformational programs.

Operational Data Store (ODS) – While in the ODS, data can be scrubbed, resolved for redundancy and checked for compliance with corresponding business rules.   An ODS stores the recent operational data consolidated from disparate sources so that business operations, analysis and reporting can be carried out while business operations are occurring. The information in ODS can also be shared with other agencies via API gateways.

Enterprise Data Warehouse – All historical structured data is consolidated in the enterprise data warehouse that stores terabytes of data for complex queries to draw conclusions from. Examples of complex queries include trends and predictive data analysis.

High Performance Analytic Database – In practice, traditional row-oriented relational databases usually perform poorly when database queries are complex; As a result, it takes far too long to generate analytical reports.   To improve performance, computer scientists have restructured the way data is stored within the database to align data layout to the column entities. These column-oriented databases boost the performance of read intensive operations of large data repository.   Other ways of improving performance involve storing data in random-access memory (RAM) rather than disk to boost data access speed. These catalogue of databases is also known as ‘In-memory Databases’.

Basic Data Visualization Tools – Data Visualization Tools present data visually to users via graphs, charts or other graphical artifacts; Leveraging on Data Visualization Tools, users can interactively query data warehouses and generate visually appealing customized reports.

Data Link Visualization Tools – For investigations on fraud and money laundry, the iterative evaluation of relationships (connections) between data entities enables the discovery of unusual patterns. The Data Link Analysis Tools enables investigators to visually find matches in data for known patterns of interest and abnormalities where known patterns are violated and discover new patterns of interest.

Advanced Analytics & Reporting Tools – Advanced Analytics & Reporting Tools are a grouping of analytic techniques used to predict future outcomes. Applying statistical models against historical data stored in data warehouses enables the uncovering of trends that can be used to predict future results.

 Online and Batch Data Requests – Inter-government agencies’ data insights sharing is usually done in 2 ways: Interactively through Application Program Interface (API) Gateway or batch mode via data file transfers. The API Gateway is a micro services gateway that exposes clients’ specific APIs to external agencies’ applications. It manages the access controls, protocol translations and service invocations between clients’ specific APIs and internal micro service providers such as the ODS or other internal systems.

HPE HAVEN Platform

The Hewlett Packard Enterprise HAVEN Platform is a proven Big Data Platform that government agencies can leverage to realize the above Data Analytics Reference Architecture.

HPE HavenThe HPE HAVEN Platform brings together a set of core HPE technologies to help organizations make the business transformation to connected intelligence and analytics-driven decision making. The technology stacks are:

  • Hadoop – the Apache Hadoop is the leading open source, NoSQL software platform for storage and processing of Big Data. HPE HAVEN supports all leading Hadoop distributions such as Cloudera, Hordonworks and MapR.
  • Autonomy IDOL – Seamless access to 100 percent of enterprise data, whether it is human or machine generated.
  • Vertica – A massively scalable, high performance column based database platform, custom-built for real-time analytics on petabyte-sized datasets. Vertica supports standard SQL- and R-based analytics, and offers support for leading BI and ETL tools and Apache Spark.
  • Enterprise Security ArcSight – Provides real-time collection and analysis of logs and security events from a wide range of devices and data sources.
  • N Apps –HPE Enterprise Services Data Scientists and Technology Specialists support in building the business intelligence applications tailored to the needs of the agencies’ requirements.

In addition, the following technologies integrate seamlessly with the HPE HAVEN Platform:

  • Extract-Transform-Load (ETL) Tools – Structured Data ETL tools include Informatica Power Center, Oracle Data Integrator, Microsoft SQL Server Integration Services (SSIS) and IBM InfoSphere DataStage. For Unstructured Data, Apache Shark, Apache Hive and Apache Pig Frameworks are forerunners.

Both Pig and Hive leverage the Map-Reduce method to access Hadoop Distributed Data Store (DFS); In addition, Apache Hive offers SQL-like syntax

scripts are written for manipulating unstructured data and to support JDBC connections to Hadoop. Apache Shark provides capabilities similar to Hive but leverages Apache Spark instead of Map-Reduce to access Hadoop DFS.

  • Basic Data Visualization Tools – QlikView and Tableau are two enterprise grade data visualization tools that enable data to be presented visually without the need of developers to write complex reporting programs.
  • Data Links Analysis Tools – VisualLinks and Centrifuge are data visualization tools that offer visualization in terms of networks and relationships rather than charts or statistical reports. They provide users with the ability to follow relationships across nodes to discover abnormal or unusual relationship patterns.
  • Advance Analytics & Reporting Tools – Open source R Programming Language, IBM SPSS and SAS Data Analytics engine are the forerunners in the advance analytic space.
  • API Gateways – A popular way of sharing structured data between government agencies online is via the RESTful web service calls. RESTful web services use standard HTTP formats for data requests and responses. API Gateways such as IBM DataPower, Apigee, Layer7 are popular appliances that manage and secure web services operated by government agencies.

Know The Benefits

The HPE HAVEN Analytics Platform and Reference Architecture brings the following important benefits to government agencies:

  • Open Platform – The platform uses open standards and best of breed analytic solutions. The approach ensures that there is no vendor lock-in and future platform supportability.
  • Scalability & Extensibility – As the data and number of users grow and problem domain grows in complexity, the above implementation can scale to match.
  • Speed – The platform is memory-efficient and data-efficient, so agencies can rapidly analyze a very large volume of structure and unstructured data.
  • Interactivity – A multi-user interactive analytics environment that supports broad spectrum of data visualization techniques that improve user’s productivity.

Conclusion

Data is the most important asset for government agencies to understand. Data analysis done right extracts value immediately to make clients happy, inform big decisions quickly and accomplish continuous improvement. It’s time to take a look.

White paper written by Yik-Joon Ho, Chief Technologist, Hewlett Packard Enterprise Singapore

Learn more at hpe.com/us/en/solutions/big-data.html