BI Future Blog: april 2010

zondag 25 april 2010

Datavault : Identifying the Satellites

In the first parts (part 1 and part 2) i wrote a small introduction about Datavault and identified the Hubs during a small analysis. Today i'm gonna identify the Links in the 3NF model of the AdventureWorksLT database.

So what are links, again? Links tie two or more hubs together. Links represent the relationship between two or more business elements. These are usually the business events or transactions. Each link has its own surrogate key, the keys from the hubs, a load date time stamp and a record source.

Below you can see the diagram in which i've identified the link tables:

How can you identify the link tables. Link tables can be identified by looking at typical transaction tables and many - to - many tables (facts). Secondly, Lindstedt states that any table that doesn't have a respective business key becomes a good link identity. First, lets start with identifying the easy link tables (many to many)

Salesorderheader: Many to Many table.
Salesorderdetail: Many to Many table.
CustomerAddress: Many to many table.
ProductModelProductDescription: Many to Many table.
Product: Many to many table

So there 1 more link table left:

The first one is the recursive Product Category table. This table has also a prent child relationship and therefore a link table is neccessary.

So that's it for now. Next time the satellites.

Greetz,
Hennie

woensdag 21 april 2010

SQL Server 2008 R2 Launch, Neuss, Germany

The next seminar on the schedule this year was the SQL Server 2008 R2 Launch in Neuss, Germany. A couple of hours driving from Utrecht but it was worth it. Key speaker was Donald Farmer,Principal Program Manager for Microsoft SQL Server.

He spoke almost the whole day(at least it appeared to me)and he had the following sessions:
* SQL Server 2008 R2 and the evolution og business Intelligence.
* Making Decisions - How Knowledge workers transform their role with Business Intelligence.
* Supporting Decisions - How IT Pros transform the game with Business Intelligence.

Well, what was my impression of this day. A couple of things: At the informatica seminar i didn't get a good understanding of the role of the datawarehouse in the future. Today with Donald i got the feeling that there are different kind of BI solutions, and it's called managed BI. This means that the Poweruser will get more tools (powerpivot, report builder) and functionality to do things and develop (a sort of prototyping).

When things are getting to strategic in this development, the datawarehouse (or BI) comes in. The solution becomes integrated into the mainstream of the organization (cq datawarehouse). So that will be role of BI, not developing or exploring BI solutions, but aiding the poweruser and integrating this in a standardized manner.

When will it be integrated into the standardized BI solution platform? Well, there are three things playing a role here. It needs to be standardized because of critical impact of the solution for the organization. Because it needs to scalable and because the skills for further development.

As Informatica (blog) spoke yeserday about real time, i didn't hear nothing today about real time processing. So that's curieus to me.

Greetz,
Hennie

dinsdag 20 april 2010

Informatica World tour 9 in Houten

Today i went to the Informatica world tour 9 seminar in Houten, Netherlands. Why,well i wanted to know more about the strategy of Informatica. Overall you can say that the strategy of Informatica is focusing on the same points as Microsoft does. They both talk about Master Data Management (MDM), they both talk about Complex Event Processing (CEP), real time and self service BI.

I'm reading quite some stuff in other blogs about the "more versions of the truth" (not single) and real time processing (not datawarehousing). So what is this about? Why are vendors/analysists (and Informatica) saying this? Are there developments which supports this movement? I think there are a couple of developments on the market:
* Hardware is increasingly faster and networks are increasing faster.
* The need for agility of organisations.
Why is one version of the truth debated? Should we have multiple versions of the truth? should we go back to the multiple logic in multiple reports?

All of the above has to do with that the datawarehouse, as we know it today, will not be the one version of the truth, but could/would be onesource for the reports. Informatica is setting the datawarehouse a side and says that it's a source like any other. Perhaps a bit cleansed, uniformed, etc, but a source like any other. Informatica puts their tools in the middle (who doesn't).

An IDC speakers (Martin Canning) told multiple times that real time processing is coming in next years. Informatica processes requests from the reports/analytic tools and decides from which source they gather information (DWH, Source, whatever). Not sure whether it's al real time.

So, i'm wondering what will be the role of datawarehouse in the future? Should there consolidated information stored? Or history for auditing and trend analysis? Informatica didn't talk about this. System of record, i didn't hear about it. Old data? It's ballast and you don't need it, said one of the speakers.

What to put in a datawarehouse and what not? Who is this going to decide?

As my post above shows, there a lot questions to be answered in the coming time. An interesting era (the information era) has started.

So, Tommorow i go to Neuss, Germany for the Launch of SQL Server 2008 R2. Great! See what they will talk about.

Greetz,
Hennie

zondag 18 april 2010

FailureOnPackage and Checkpoints

A couple of weeks i was experimenting with FailureOnPackage setting to true and false in combination with checkpoints. I was trying to find out whether i have to put int on true and false and what effect was with checkpoints.

You'll know what checkpoints are but i little explanation for those who haven't use checkpoints before. SSIS can restart failed packages from thepoint of failure, instead of rerunning the whole package. If a package is configured to use checkpoint, information about the package execution is written to a checkpoint file. When the failed package is rerun, the checkpoint file is used to restart the package from the point of failure. If the package runs successfully, the checkpoint file is deleted, and then re-created the next time that the package is run (MSDN).

Below you can see a checkpoint with FailureOnPackage = True. On the left an initial run and on right the rerun.

When the FaulureOnPackage = False you'll see the following happening (on the left the intial run on on the right the rerun)

So it's very omportant to set the FailureOnPackage to true.

So how does the checkpoint file look like.

<DTS:Checkpoint xmlns:DTS="www.microsoft.com/SqlServer/Dts" DTS:PackageID="{24981BF3-705B-4AE8-9371-802509214EE1}">

<DTS:Variables DTS:ContID="{24981BF3-705B-4AE8-9371-802509214EE1}" />
<DTS:Container DTS:ContID="{8C9FE8D0-F622-411B-8C7C-8ADBB9F4CFC5}" DTS:Result="0" DTS:PrecedenceMap="" />
</DTS:Checkpoint>

So this file is only saying: the package stopped at this location. Nothing more. Deleting the file will reset the package and the whole thing will run when execute the package

Greetz,
Hennie

woensdag 14 april 2010

Datavault: modeling the hubs

This post will be about the first step of modelling a 3NF model into a datavault model: modeling the hubs. For this purpose i will be using the AdventureWorksLT database from SQL Server 2008 R2. You can download this from Microsoft.com. This database will be transformed into a datavault database (AdventureWorksLTDV). How does this AdventureWorksLT database look like:

For a detailed diagram of the AdventureWorksLT datamodel see below (you can download the visio from Microsoft.com):

This diagram consists of 12 tables (10 connected and 2 tables not connected). This model is focused on the sales orders, customers and products. These are the main subjects of this model. What are the steps to be taken for turning this 3NF model into a datavault model? Well, Linstedts defines 4 steps:

1. Model the Hubs. This requires an understanding of the business keys and their usage across the designated scope

2. Model the links. Forming the relationships between the keys-formulating an understanding of how the business operates today in the context of the business key.
3. Model the satellites. Providing context to each of the business keys as the transactions (Links) that connects the hubs together
4. Reorganize the satellites into separated tables for different rates of changes.

What are the rules of datavault:

1. Hub keys cannot migrate into other hubs (no parent/child like hubs).
2. Hubs must be connected through links.
3. More than two hubs can be connected through links.
4. Links can be connected to other links.
5. Links must have at least two hubs associated with them in order to be instantiated.
6. Surrogate keys may be utilized for hubs and links.
7. Surrogate keys may not be utilized for satellites
8. Hub keys always migrate outwards.
9. Hub business keys never change, hubs primary keys never change
10. Satellites may be connected to hubs or links
11. Satellites always contain either a loaddate timestamp or a numeric reference to a standalone loaddate timestamp sequence table.
12. Stand alone tables such as calendars, time, code and description tables may be utilized.
13. Links may have a surrogate key.
14. If a hub has two or more satellites, a point in time table may be constructed for ease of joins.
15. Satellites are always delta driven, duplicate rows never appear.
16. Data is separated into satellite structures based on 1) type of information 2) rate of change.

Ok, well according to the steps defined by datavault, we should start first with the hubs. Let’s walk through the tables. Datavault makes a distinction betwee business keys and surrogate keys. I haven't found a good definition for this. Business key seems to me a identification of a object which the business speaks about and the surrogate key is an id that more technical is. When i have clarification about this i'll blog about this.

The tables:

Salesorderheader : This table appears to have a surrogate key. But, no further investigation is needed because I know this should be a link table (a typical fact table with multiple foreign key relations). So no Hub for this one.
Salesorderdetail : This is a typical non Hub table because this is more a aiding table due to normalization than a table that stands on its own.
Customer: Lastname is probably the best choice as the business key. CustomerID is the surrogate key.
CustomerAddress. No business is identifiable . It’s more a normalization table.
Address: No business is identifiable . It’s more a normalization table
Product: Name is the businesskey. ProductID is the surrogatekey. Therefore it’s a Hub table
Product Model: Name is the businesskey and ProductmodelID is the surrogatekey. Therefore it’s a Hub table
Productmodelproductdescription: no business key available.
ProductDescription:Description will be the businesskey and Productdescription is the surrogate key. Therefore it’s a Hub table
Product Category: Name is the businesskey and ProductCategoryID is the surrogatekey . Therefore a Hub table.

Buildversion and Errorlog are different tables. I'll blog about this later.

Well, we have identified the first set of HUB tables. May be something changes later on, but i’m pretty confident that I have identified the right hub tables. In diagram it looks something like this:

So next the time links....

Greetz,
Hennie

zaterdag 10 april 2010

Datavault: an introduction

This year one of my goal is mastering Datavault modeling. In June i will be certifying myself in datavault. Dan Linstedt is coming to the Netherlands and will lecturing datavault modeling. The last day will be the day that a exam will be held. So in the coming weeks I’ll be blogging about the different aspects of datavault.

So what ‘s datavault all about? Well, it’s an alternative way of modeling. Other (data) model techniques are Boyce Codd normal Form and starmodelling (Kimbal). It’s a hybrid approach encompassing the best breed between3rd normal from and star schema. As Linsteds says: the design is flexible, scalable, consistent and adaptable to the needs of the enterprise. The model meets the needs of a Enterprise datawarehouse .

Below you can see an example of a complex datavault model, presented in a high level format:

This diagram shows a highly complex datavault datamodel. Looking at this diagram will give you a impression of a neuron network like the one created in brains. And yes it's influenced by that principle.

Mainly, there are three types of tables in the datavault design:

Satellites (Green) : Tables with attributes which changes at different rates (e.g. Customer name). Satellites provide the context for the business processes that are captured in hubs and links. This is mostly descriptive - and historical data and it’s used for descriptive information for the hubs or the links. You could compare this with TYPE II dimension of the Kimball methodology. It's function to provide context around the hub key.
Hubs (purple): These are the stand alone tables and they contain a list of the unique business keys. These keys identify the information domains used by the business. If the business were to lose the key they would lose the reference to the context, or surrounding information. Examples are customers (CustomerNumber), employees (Social security number), products (part_number), Sales orders (Orderid), purchase orders (PurchaseCode), Invoices (InvoiceNumber), etc.
Link (Black): These are the associatons or transactions between the business keys and it's used for relationships between hubs and links. Links are the physical representation of many-to-many 3NF relationship. The link representthe relationship between two or more business components (two or more business keys). This is an adaptation of a many to many relationship in 3NF in order to resolve problems related to scalability and flexibility.

So this is it for now. The next blog's will be about how to turn a 3NF datamodel into a datavault model.

Greetz,

Hennie

woensdag 7 april 2010

Statistics revisited

Currently, i am working on a short project. In SQL Server 2000 a intensively used table is used with a lot of columns and there a lot of indexes created on this table(about 80). On top of that SQL Server 2000 has used all the 'slots' of the indexes for Statistics. In 2005 and 2008 indexes and statistics are seperated but in 2000 it isn't.

I've searching for this quite long and i found on Kimberly Tripp site (http://www.sqlskills.com/blogs/Kimberly/post/Indexes-in-SQL-Server-20052008-Best-Practices-Part-1.aspx#ixzz0kP811y1Z) a confirmation for this. The vendor wants to add indexes on this table but can't because the index slots are being used by the statistics.

So this mean:
Clustered index                     1
Non clustered index           80
Statistics used by the index 81
Totaal                               162

And there is a maximum of 250 indexes/statistics in SQL Server 2000 and this would mean there 250-162 = 88 statistics being used and we have to determine whether they should be dropped. How? i haven't found out yet. When i found the answer i'll come back to blog about it.