Monday, March 16, 2009

Data Profiling - Basic Measures

Last week I talked about how data profiling is an important first step in getting your data quality program off the ground. Data profiling is just another term used to describe the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data.

There are a lot of robust data profiling tools available but you should be able to begin performing analysis using a common tool such as excel or access.


Here are the basic measures that we started with:

Company (or Account) Name
· Contains the word(s) “duplicate” or “out of business”
· Contains invalid characters

Company (or Account) Address
· Is blank
· Contains invalid characters
· Contains the words Address Not Known” or “unknown”

City
· Contains invalid characters and or 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
· Contains the word “various” or “unknown” or 'city Not Known'

Company (or Account) Annual Sales (or income)
· Is Blank
· Contains a numeric value less than 100$

Primary Contact First Name
· First Name is blank
· Contains the word(s) ‘Unknown’ or ‘Not Known’
· Contains the words Address Not Known” or “unknown”

Primary Contact Second Name
· Second Name is blank
· Contains the word(s) ‘Unknown’ or ‘Not Known’
· Contains the words Address Not Known” or “unknown”

Primary Contact Phone
· Is Blank

It’s important to keep in mind that you should be measuring against the business rules. For example, if a phone number is mandatory then you should not need to check if there are any blanks.

It is also important to understand the purpose for the data. For example, if the purpose of collecting the Company (or Account) address is to mail documentation then you should probably be checking to see if some of your address data contains Physical address information (such as 123 Main Street) vs. Mailing address information (such as P.O. Box 123).


The Initial Results

It doesn’t matter if you are a data geek or not, the results are always very interesting. Some of the statistics we identified as a result of our initial analysis were as follows:
47% of our address information was blank
10% of our address information was ‘Not Known’ or ‘unknown’
8% of our Company Names had ‘Duplicate’ in the name


What do your results tell you?

Your results tell you where you need to focus efforts to improve or even remove the data. If you are finding a lot of blanks (or null values) for your data then your company probably is not using the information collected and you may want to stop collecting it.


The Purple Cow (aka ‘how to stand out from the rest of the crowd’)

Since we didn’t have a sponsor and we had no idea who would be interested in the results we made a splashy ‘Did you know’ flyer that listed the results and posted them in the washrooms – on the doors of the stalls. Needless to say we got people’s attention...


Next Week
Next week I’ll talk about the types of resources we have on the team and how one of the most important attributes your team members can have is their ability to market the message.

Monday, March 9, 2009

From the Ground Up

I've been thinking about writing about Data Quality for a while now. I'm responsible for a Data Quality Program at a Crown Corporation and since the program started just 2 1/2 years ago a lot of progress has been made. What's the big deal you say? Well the program has no executive business sponsor (bad!), our team has been floating around different groups trying to find a home (bad!), there is no formal mandate (bad!) and yet despite all those no-No's we have made a lot of great progress! We are a team of enthusiastic creative thinkers who have broken the mold and achieved success due to 2 key strategies:
1/ Our program is based on Industry Best Practices
2/ Our methods for communicating and engaging others are very 'Purple Cow' - "In his book Purple Cow: Transform Your Business by Being Remarkable, Seth Godin says that the key to success is to find a way to stand out--to be the purple cow in a field of monochrome Holsteins.

The title: 'Data Quality - From the Ground Up', is just that. Implementing a successful Data Quality Program without an executive sponsor or a mandate starting from the ground up CAN work and the goal of writing this is to share these strategies in the hopes that what has worked for us will help others achieve the same success.

Getting Started - The Basics
This week I'll start with the basics; those industry best practices that are logical and do-able.
1/ Identify the important data
For us it was Customer type data and we started with the basics; name, address, city, province/state, country, phone, fax, email, website.
2/ Profile the important data
Data profiling is just another word for data analysis. Get an extract of the data and start with the most basic analysis. Is it complete? How much of it is blank?
3/ Find someone who cares
For us it meant someone in IT, as IT has known about data quality issues for a long time. Better, would be someone on the business side, but take what you can get.
4/ Communicate your results
Find a way, any way to communicate what you've found. Post the results on your intranet, send them in an email (interesting profiling results tend to get forwarded), or post them by the printers or water cooler. More information coming later on some of the 'purple cow' methods that worked for us.
5/ Define the data
Start gathering and documenting the basic definitions and business rules for the important data and share it. You would not believe how people will thank you.

This was our approach and we only used some basic Microsoft tools like excel for the profiling and word for the definitions. Today we have an enterprise data management tool for the profiling and a wiki with over 900 corporate definitions. Not bad for 2 1/2 years.

There are a lot of other things that can and need to be done as well, but the methods described above is a good start and you can't go wrong.

In the coming blogs I'll talk about the results of each of the above and explain how we logically progressed to where we are today. I'll also share some fun 'purple cow' stuff that we've done with amazing results.

Next Week: What did we find when we profiled our important data and what did we do about it?


Thoughts to ponder: Be a yardstick of quality. Some people aren't used to an environment where excellence is expected.
~Steve Jobs~