Tuesday, September 16, 2008

UNDERSTANDING ENTITIES - CHAPTER 2

What is an Entity ?

An entity is s physical representation of a logical grouping of data. Entities can be tangible, real things, such as a PERSON or ICE CREAM, or intangible concepts, such as a COST CENTER or MARKET. Entities do not represent single things. Instead, they represent collections of instances that contain the information of interest for all instances or occurrences. For example a PERSON entity represents instances of things of type Person. Gabriel De Angelies, R.J golcher, Jessica Corter, and Venessa Westley are examples of specific instances of PERSON. A specific instance of an entity is represented by a row and is identified by a primary key.

An entity has the following characterstics:
• It has a name and description.
• It represents a class, rather than a single instance of a concept.
• It has the ability to uniquely identify each specific instance.
• It contains a logical grouping of attributes representing the information of
interest to the enterprise.


Formal Entity Definitions

The following list contains entity definitions from some of the most influential leaders in data modeling. Notice the similarities:
• Chen (1976): “A thing which can be distinctly identified.”
• Date (1986): “Any distinguishable object that is to be represented in the database.”
• Finklestein (1989): “A data entity represents some ‘thing’that is to be stored for later reference. The term entity to the logical representaion of data.”

Defining Entity Types

Within the independent and dependent entities are entity types:
• Core entities-- These are sometimes called primary or prime entities. They represent the important objects about which the enterprise in interested in keeping data.
• Code/reference/classification entities-These entities contain rows that define the set of values, or domain, for an attribute.
• Associative entities--These entities are used to resolve many-to-many relationships.
• Subtype entities-These entities come in two types, exclusive and inclusive.

Core Entity

Core entities are the most important objects about which an enterprise is interested in keeping data. They are often referred to as prime, principal, or primary entities. Because these entities are so important, it is likely that they are used elsewhere in the enterprise. Take the time to look for similar entities because there are many opportunities for the reuse of core entities. Core entities should be modeled consistently throughout the enterprise. Good modelers consider this an essential best practice.


Note the straight corners of the independent entities, STORE and ICE CREAM and the rounded corners of the dependent entity STORE ICE CREAM.

A core entity can be an independent entity or a depenent entity. Figure 2.1 provides examples of core entities for an enterprie that sells ice cream. ICE CREAM represents the base products sold by the enterprise. STORE is an example of a distribution channel, or the vehicle through which a product is sold.

Consider that the enterprise is doing well and has decided to add another STORE. The model requires no change to support the addittion of a new instance of STORE. It is simpy another row added to the STORE entity. The same applies to ICE CREAM.

Notice the core entities ICE CREAM and STORE. Although the example may seem straight forward, it illustrates a powerful concept regarding the modeling of core entities.

Understanding how to model core entities as scalable and extensible containers of information requires the modeler to think about the entity as an abstract concept and to model the information independently of the way it is used today. In this example, model ICE CREAM completely outside the context of STORE and vice versa. So, if the enterprise decides to sell ICE CREAM using an addittional channel, such as the Internet or door-to-door, the new channel can be added without distrubing other entities.

Code Entity

Code entities are always independent entities. They are often referred to as reference, classification, or type entities, depending on the methodology. The unique instances represented by code entities define the domain of values for attributes present in other entities. You might be tempted to use a single attribute in a code table. It is a best practice to include at least three attributes in a code entity: an identifier, a name (sometimes called a short name), and a description.

In Figure 2.2, TOPPING is an independent entity; note the sharp corners. TOPPING is also a code or classification entity. The instances (or rows) of TOPPING define the list of toppings available.





Figure 2.2
Code entities allow an enterprise to define a set of values for consistent use throughoput the enterprise. The instances of a code entity define a domain of values for use elsewhere in the model.

Code entities usually contain a limited number of attributes. I have seen instances where these entities contain only a single attribute. I prefer to model code entities with an artificial identifier. Using an artificial identifier, along with a name and description, allows the addittion of new kinds of TOPPING to be added as instances (rows) in the entity. Note that TOPPING contains three attributes.

I often refer to code entities as corporate business objects. The name, corporate business objects, indicates that the entities are defined and shared at a corporate level, not by a single application, system, or business unit. These entities are often shared by many databases to allow consistent roll-up reporting or trending analysis.

Associative entity

Associative entities are entities that contain the primary key from two or more other entities. Associative entities are always dependent entities. They are used to resolve many-to many relationships between other entities. Many – to many relationships are those in which many instances of one entity are related to many instances of another. Associative entities allow us to model the intersection between the instances of the two entities, thereby allowing each instance in the associatie entity to be unique.

Note

Many- to – many relationships cannot be implemented in a physical database. ERwin will automatically create an associative entity to resolve a many-tomany relationship when the model is changed from logical to physical mode.

Figure 2.1 uses an associative entity to resolve a many-to-many relationship between STORE and ICE CREAM. The addittion of an associative entity allows the same ICE CREAM to be sold in many instances of STORE, while not requiring every STORE to sell the same ICE CREAM. The associative entity STORE ICE CREAM resolves the fact that an instance of STORE sells many instances of ICE CREAM and an instance of ICE CREAM is sold by many instances of STORE.


Subtype Entity

Subtype entities are always dependent entities. You should use subtype entities when it makes sense to keep different sets of attributes for the instances of an entity. Finklestein refers to subtype entities as secondary entities. Subtype entities almost always have one or more “sibling” entities. The subtype entity siblings are related to a parent entity through a special relationship that is either exclusive or inclusive.

Note

Subtype sibling entities that have an exclusive relationship to the parent entity indicate that only one sibling has an instance for each instance of the parent entity. Exclusive subtypes represent an “is a” relationship.
Subtype sibling entities that have an inclusive relationship to the parent entity indicate that more than one sibling can have an instance for each instance of the parent entity.

Figure 2.3 shows the CONTAINER entity and the subtype entities CONE and CUP. The ice cream store apparently does not sell ice cream in bulk, only single servings. Note that an instance of CONTAINER must be either a CONE or a CUP. A CONTAINER cannot be both a CONE and a CUP. This is an exclusive subtype.

Figure 2.3, the PERSON entity has two subtypes, EMPLOYEE and CUSTOMER. Note that an exclusive subtype would not allow a single instance of PERSON to contain facts common to both an EMPLOYEE and a CUSTOMER. A VENDOR can also be a CUSTOMER.These are examples of inclusive subtypes.






Figure2.3
Two examples of subtype entities, PERSON and CONTAINER. Both use ERwin IE notation to represent exclusive and inclusive subtypes. The (X) in the subtype symbol of CONTAINER indicates exlusive. The absence of the (X) in the subtype symbol indicates inclusive.


Structure Entity

Sometimes, instances of the same entity are related. In his 1992 book Strategic Systems Development, Clive Finklestein proposes the use of a structure entity to represent relationships between instances of an entity. Relationships between instances of an entity are called recursive relationships. “ Recursive relationships are a logical concept, a concept sometimes difficult for users to grasp.

Figure 2.4 shows the addittion of a structure entity that allows a relationship between instances of EMPLOYEE. The diagram shows that the EMPLOYEE subtype of the PERSON entity has two subtypes, SERVER and MANAGER. The EMPLOYEE STRUCTURE entity represents the relationship between instances of EMPLOYEE.




Figure 2.4
Structure entity illustrates Clive Finklestein’s resolution for recursive relationship.

Naming Entities

The name assigned to an entity should be indicative of the instances of the entity. The name should be understood and accepted across the enterprise. When selecting a name, keep an enterprise view and take care to use a name that reflects how the data is used throughout the entire enterprise, not just a single area. Use names that are meaningful to the user community and domain experts.

I hope you have a set of naming conventions that were developed for use in the enterprise, or an enterprise data model, to guide you. Using naming conventions ensures that names are constructed consistently across the enterprise, regardless of who constructs the name. The following sections provide a starter set of naming conventions and give examples of good and bad names.

Entity Naming Conventions

Naming conventions might not seem important if you work in a small organization with a small set of users. However, in a large organization with many development teams and many users, naming conventions greatly facilitate communication and data sharing. As a best practice, you should develop and maintain naming conventions in a central location and then document and publish them for the whole enterprise.

I include some pointers for beginning a good set of naming conventions, just in case your organization has not yet developed one:

• An entity name should be as descriptive as necessary. Use single-word names only when the name is a widely accepted concept. Consider using noun phrases.

• An entity name should be a singular noun or noun phrase. Use PERSON instead of PERSONS or PEOPLE, or CONTAINER instead of CONTAINERS.

• An entity name should be unique. Using the same entity name to contain different data, or a different entity name to contain the same data, is needlessly confusing to developers and users alike.

• An entity name should be indicative of the data that will be contained for each instance.

• An entity name should be indicative of the data that will be contained for each instance.

• An entity name should not contain special characters (such as! @,#,$,%,^,&,*, and so on) or show possession (PERSON’S ICE CREAM).

• An entity name should not include acronyms or abbreviations unless they are part of the accepted naming conventions.

I encourage modelers to use good naming conventions if they are available and to develop them if they do not follow these guidelines.

Sunday, September 14, 2008

DATA MODELING CONCEPTS CHAPTER 1



CHAPTER 1-DATA MODELING CONCEPTS



The Role of Data Modeling



Data modeling tasks provide the most benefit when performed early in the development lifecycle. The model provides information critical to understanding the scope of a project for iterative development phases. Beginning the implementation phase without a clear understanding of the data requirements might cause your project to incur costly overruns or end up on the scrap heap.



An Introduction to Project Development



Many publications discuss project development, and this text does not cover this subject in detail. I included this section to assist modelers in understanding the role of data modeling in project development and to provide an understanding of when modeling should occur.



Most companies follow a methodology that outlines the development lifecycle selected to guide the development process. To some degree, most adhere to the same order of high-level concepts:



1. Problem definition

2. Requirements analysis

3. Conceptual design

4. Detail design

5. Implementation

6. Testing



This development method is generally referred to as the waterfall method. As you can see in Figure 1.1, each phase is completed before moving to the next, creating a “waterfall” effect.





Figure 1.1

The waterfall method of project development. Note that the results of each phase cascade into the next.



Many projects are developed using iterations or phases. An iterative development approach decreases risk by breaking the project into discrete manageable phases. Each phase includes analysis, detail design, implementation, and testing. Subsequent phases build upon and leverage the functionality of the preceding phase. However, within each phase, the waterfall method applies.



As with most engineering projects, you create a data model by following a set of steps.

1. Problem and scope definition

2. Requirements gathering

3. Analysis

4. Logical data model creation

5. Physical data model creation

6. Database creation



Figure 1.2 illustrates how each step provides input for the next.







Figure 1.2

Logical data model creation can occur prior to selecting a database platform (Oracle, DB2, Sybase, and so on). ERwin can provide support for specific physical properties if the physical data model is produced after the database platform is selected.



Problem and Scope Definition



Begin logical data modeling by defining the problem. This step is sometimes rederred to as writing a mission or scoping statement. The problem definition can be a simple pragraph or it can be a complex document that outlines a series of business objectives. The problem definition defines the scope, or boundary, of the data model, much the way a survey defines property boundaries.



Gathering Information Requirements



Most industry experts agree that the most critical task in a development project is an accurate and complete definition of the requirements. In fact, an incomplete or inaccurate understanding of requirements can cause expensive re-work and significant delay.



Gathering information requirements is the act of discovering and documenting the information necessary to identify and define the entities, attributes, and business rules for the logical model. There are two well-recognized methods for gathering requirements: facilitated sessions and interviews. Most development methodologies recommend facilitated sessions. The sections that follow provide high-level guidelines for gathering information requirements using facilitated sessions. A later exercise demonstrates how to use the information gathered to create a data model using ERwin.



Analysis



You must analyze and research the data requirements and business rules to produce a complete logical model. Analysis tasks should provide accurate and complete definitions for all entities, attributes, and relationships. Metadata, data about the data, is collected and documented during the analysis phase.



The analysis can be performed by the modeler or by a business analyst. Either the model or the business analyst works with users to document how users intend to use the data. These tasks drive out the corporate business objects needed to support the information requirements. Corporate business objects are also called code, reference, or classification data structures. This is also the opportunity to document code values that will be used. You should carefully document any derived data, data that is created by manipulating or combining one or more other data elements, and data elements used in the derivation.



Logical Data Model



A logical data model is a visual representation of data structures, data attributes, and business rules. The logical model represents data in a way that can be easily understood by business users. The logical model design should be independent of platform or implementation language requirements or how the data will be used.



The modeler uses the data requirements and the results of analysis to produce the logical data model. The modeler also resolves the logical model to third normal form and validates against the enterprise data model, if available. Later sections provide a description of a complete logical model, resolving a logical model to third normal form, an overview of an enterprise model, and provide some tips on validating a logical model against an enterprise model.



After you compare the logical model and enterprise data model and make any necessary changes, it is important to review the model for accuracy and completeness. The best practice includes a peer review as well as a review with the business partners and development team.





Entities



Entities represent the things about which the enterprise is interested in keeping data. An entity can be a tangible object such as a person or a book, but it can also be conceptual such as a cost center or business unit. Entities are nouns and are expressed in singular form, CUSTOMER as opposed to CUSTOMER, for clarity and consistency.



You should describe an entity using factual particulars that make it uniquely identifiable. Each instance of an entity must be separate and cleary identifiable from all other instances of that entity. For example, a data model to store information about customers must have a way of distinguishing one customer from another.



Figure 1.3 provides some examples of entities.









Figure 1.3

Here are examples of using ERwin to display entities in their simplest form.





Attributes



Attributes represent the data the enterprise is interested in keeping about objects. Attributes are nouns that describe the characteristics of entities.







Relationships



Relationships represent the associations between the objects about which the enterprise is interested in keeping data. A relationship is expressed as a verb or verb phrase that describes the association. Figure 1.5 provides some examples using ERwin’s Information Engineering (IE) notation to represent relationships.



Normalization



Normalization is the act of moving attributes to appropriate entities to satisfy the normal forms. Normalization is usually presented as a set of complex statements that make it seem a complicated concept. Actually, normalization is quite straightforward: “One fact in one place,” as stated by C.J.Date in his 1999 books An Introduction to Database Systems. Normalizing data means you design the data structures in such a way as to remove redundancy and limit unrelated structures.



Five normal forms are widely accepted in the industry. The forms are simply named first normal form, second normal form, third normal form, fourth normal form, and fifth normal form. In practice, many logical models are only resolved to third normal form.



Formal Definitions of Normal Forms



The following normal form definitions might seem intimidating; just consider them formulas for achieving normalization. Normal forms are based on relational algebra and should be interpreted as mathematical functions.



Business Normal Forms



In his 1992 book, Strategic Systems Development, Clive Finklestein takes a different approach to normalization. He defines business normal forms in terms of the resolution to those forms. Many modelers, myself include, find this business approach more intuitive and practical.



First business normal form (1BNF) removes repeating groups to another entity. This entity takes its name, and primary (compound) key attributes, from the original entity and forms the repeating group.



Second business normal form (2BNF) removes attributes that are partially dependent on the primary key to another entity. The primary (compound) key of this entity is the primary key of the entity in which it originally resided, together with all additional keys on which the attribute is wholly dependent.



Third business normal form (3BNF) removes attributes that are not dependent at all on the primary key to another entity where they are wholly dependent on the primary key of that entity.



Fourth business normal form (4BNF) removes attributes that are dependent on the values of the primary key or that are optional to a secondary entity where they wholly depend on the value of the primary key or where they must (it is mandatory) exist in that entity.



Fifth business normal form (BNF) exists as a structure entity if recursive or other associations exist between occurrences of secondary entities or if recursive associations exist between occurrences of their principal entity.



A Complete Logical Data Model



A complete logical model should be in third business normal form and include all entities, attributes, and relational ships required to support the data requirements and the business rules associated with the data.





Data warehouse Concepts

Central Data warehouse

A Central Data Warehouse is a repository of company data where a database is created from operational data extracts. This database adheres to a single, consistent enterprise data model to ensure consistency in decision making support across the company.

Active Metadata Warehouse

An Active Metadata Warehouse is a repository of Metadata to help speed up data reporting and analyses from an active data warehouse. In its most simple definition, a Metadata is data describing data.

Enterprise Data Warehouse:

Enterprise Data Warehouse is a centralized warehouse which provides service for the entire enterprise. A data warehouse is by essence a large repository of historical and current transaction data of an organization. An Enterprise Data Warehouse is a specialized data warehouse which may have several interpretations

Functional Data warehouse

The Enterprise Data Warehouse is broken down into Functional Data Warehouse based one functional Data. Depending on the size of the company and their financial capability, a Functional Data Warehouse may serve on department or may server more. There are also companies that have branches in many different geographic locations around the globe and their Enterprise Data Warehouse may set up differently with different clustering for Functional Data Warehouses.

Operational Data Store

An Operational Data Store (ODS) is an integrated database of operational data. Its sources include legacy systems and it contains current or near term data. An ODS may contain 30 to 60 days of information, while a data warehouse typically contains years of data.

Friday, September 12, 2008

ODS Vs. Data Warehouse

ODS Vs. Data Warehouse

Operational Data store

- Data Focused Integration From Transaction Processing Focused Systems

- Age Of The Data is Current, Near Term (Today, Last Week’s)

- Primary Use: Day-To-Day Decisions Tactical Reporting Current Operational
Results
- Frequency Of Load: Twice Daily , Daily, Weekly

Datawarehouse

- Subject Oriented,Integrated,Non-Volatile,Time Variant

- Age of the Data is Historic (Last Month, Qtrly, Five Years)

- Primary Use: Long-Term Decisions Strategic Reporting Trend Detection

- Frequency of load : Weekly, Monthly, Quarterly

Tuesday, September 2, 2008

Incompatible objects

Incompatible objects nothing but Sometimes it is not possible to use certain combinations of objects in data providers. This situation arises when objects bear no relationship to one another. These objects are called incompatible objects.

For example, the Island Resorts Marketing universe contains the
[Reservation Year] and [Revenue] objects, which are incompatible. This is
because there is no revenue associated with a reservation. Revenue is
generated only when the customer is invoiced. The underlying database
structure reflects this; you cannot build a query that aggregates revenue by
reservation year because there is no such thing as revenue by reservation
year. In other words, the aggregation context that you specified for the
[Revenue] object does not exist.

What is a scope of analysis?

The scope of analysis for a query is extra data that you can retrieve from the
database to give more details on the data returned by each of the objects in a
query. This extra data does not appear in the initial result document, but it
remains available in the data cube, so you can pull this data in to the report to
allow you to access more detail at any time. This process of refining the data
to lower levels of detail is called drilling down on an object

In the universe, the scope of analysis corresponds to the hierachical levels
below the object selected for a query.

You can set this level when you build a query.

You can also create a custom scope of analysis by selecting specific
dimensions to the Scope of Analysis pane.

For example, a scope of analysis of one
level down for the object Year, would include the object Quarter, which
appears immediately under Year.