Monday, March 16, 2009

Data Profiling - Basic Measures

Last week I talked about how data profiling is an important first step in getting your data quality program off the ground. Data profiling is just another term used to describe the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data.

There are a lot of robust data profiling tools available but you should be able to begin performing analysis using a common tool such as excel or access.

Here are the basic measures that we started with:

Company (or Account) Name
· Contains the word(s) “duplicate” or “out of business”
· Contains invalid characters

Company (or Account) Address
· Is blank
· Contains invalid characters
· Contains the words Address Not Known” or “unknown”

· Contains invalid characters and or 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
· Contains the word “various” or “unknown” or 'city Not Known'

Company (or Account) Annual Sales (or income)
· Is Blank
· Contains a numeric value less than 100$

Primary Contact First Name
· First Name is blank
· Contains the word(s) ‘Unknown’ or ‘Not Known’
· Contains the words Address Not Known” or “unknown”

Primary Contact Second Name
· Second Name is blank
· Contains the word(s) ‘Unknown’ or ‘Not Known’
· Contains the words Address Not Known” or “unknown”

Primary Contact Phone
· Is Blank

It’s important to keep in mind that you should be measuring against the business rules. For example, if a phone number is mandatory then you should not need to check if there are any blanks.

It is also important to understand the purpose for the data. For example, if the purpose of collecting the Company (or Account) address is to mail documentation then you should probably be checking to see if some of your address data contains Physical address information (such as 123 Main Street) vs. Mailing address information (such as P.O. Box 123).

The Initial Results

It doesn’t matter if you are a data geek or not, the results are always very interesting. Some of the statistics we identified as a result of our initial analysis were as follows:
47% of our address information was blank
10% of our address information was ‘Not Known’ or ‘unknown’
8% of our Company Names had ‘Duplicate’ in the name

What do your results tell you?

Your results tell you where you need to focus efforts to improve or even remove the data. If you are finding a lot of blanks (or null values) for your data then your company probably is not using the information collected and you may want to stop collecting it.

The Purple Cow (aka ‘how to stand out from the rest of the crowd’)

Since we didn’t have a sponsor and we had no idea who would be interested in the results we made a splashy ‘Did you know’ flyer that listed the results and posted them in the washrooms – on the doors of the stalls. Needless to say we got people’s attention...

Next Week
Next week I’ll talk about the types of resources we have on the team and how one of the most important attributes your team members can have is their ability to market the message.


  1. Really cool to see some real-world metrics. Thanks for sharing them. I'm curious though, would there be any positive to aggregate the metrics into business issues? So for example, if you take all the address elements and figure out how many (%) contacts would fail if you tried to mail them. Since there is a dollar figure associated with mailing, it might make it easier to attract a sponsor's attention. Tracking that over time would just show how valuable you are.

  2. Thanks for the comments Steve, I'm flattered to say the least! Regarding the metrics, we have progressed to the state that we are measuring returned mail. We even use Trillium software (Discovery and Quality) to profile and apply data cleansing to the address (among many other things). That being said though we still have challenges engaging sponsors. It's not that they don't care, it's just that they have many other priorities.
    In terms of maturity, I'd say we're within the 2nd level. It's getting past the next level that will be the real challenge. I'm sure you'll be interested in my upcoming post on data governance...that's where all the fun is right?