Integration of heterogeneous multidimensional data marts
The integration process is often undertaken by expert database practitioners who will need to analyze the structure of the data, and match schemas and data before creating an integrated view of the data for visualization and analysis.
Such a manual process may be acceptable for databases used in transaction processing applications but does not help decision makers who need access to the information quickly and cost effective in a constantly changing environment.
This thesis addresses several challenges towards automating the integration of data warehouses based on a dimensional model known as Star schema.
We recognize that the structure of multidimensional data, namely dimension hierarchies, is critical to the accuracy of the integration but is not always available or accessible.
To address this problem, we infer dimension hierarchies from their instances, and demonstrate that they are sufficient to ensure the accuracy of the integration even though they may vary from the intended hierarchies.
To improve the accuracy of matching Star schemas, we propose a more precise representation of Star schemas and demonstrate its effectiveness by comparing it against the existing approaches that treat Star schemas as relational models.
To match instances of dimensions, we demonstrate that a graph matching algorithm is effective and performs with a high level of accuracy.
We propose algorithms which enforce the tree structure of integrated data which is necessary for correct aggregation, and reduce false positive cases occurring during the instance matching.
The effectiveness of our algorithms is shown through experiments with real life data.
Despite perfectly matching schemas and hierarchies, there are often dimensions with mismatching data which restrict the scope of the integration.
We propose to relax the requirement for dimension compatibility, and introduce measures that quantify the loss of data resulting from the less strict requirement.
These measures enable data analysts to identify lossless fragments of data, and thereby, extend the scope of the integrated data.
To provide a more comprehensive view of data for analysis, we link the integrated data with the data exclusive to each source by extending the navigation operation for multidimensional data.
These contributions help towards shifting the integration problem away from expert database practitioners to empowered data analysts in combining multidimensional data from multiple sources in real time, and in a cost effective manner.
History
Degree Type
Doctorate by ResearchImprint Date
2012-01-01School name
School of Science, RMIT UniversityFormer Identifier
9921861137201341Open access
- Yes