A relational database store data in a series of tables so that the data models a mathematical theory of relations. The model allows for queries based on projection, selection and join, among other operations, and connect the data in the tables by way of keys. The queries are expressed in a standard syntax called SQL, the Standard Query Language, which is common to all various vendors of relational databases.
The theory of relations states that data is arranged as various sets of tuples, called relations, where a tuple is collection of values for attributes. A relation states which attributes it collects. Concretely speaking, the attributes are the columns of a table, and the tuples are rows in the table. Constraints among the attributes will allow only certain tuples to be valid members of the relation, and the database should not allow rows into be inserted in the table if they would violate the constraints.
For instance, the mathematical theory says that if two tuples agree in the value of all attributes, they are the same tuple. In a table, it is possible for two distinct rows to contain the same data for all columns. However, the database should prevent this from happening because that would not be consistent with the mathematical model.
The description of the relations, including the attributes for each relation, and the governing constraints, are the database schema. From the schema we can determine for each table which attribute collections are keys. Loosely speaking, a key is a collection of attributes whose values will uniquely determine a tuple in the relation. For example, in a database of students, the student ID number will serve as a key. The first name, if that is an attribute in the relation, would not serve as a key, since many students could have the first name.
Typically there will be a single column in the table containing a unique integer index from each row of the table, and these will be the primary key for the table. But the definition for key is more abstract, and this definition must be used in order to understand normal forms.
There are a sequence of normal forms determined by the schema of the databse. It is cosidered A Good Thing if a database ranks high in this sequence of normal forms.
First normal form requires that attributes are atomic. There should be no internal fields in the attribute, or if there are, the relations and schemas don't care about it. Typical examples of a violation of first normal form would be a relation which relates a parent attribute with a list-of-children attribute. If the intention is to query of the individual children of the list-of-children, this schema would violate first normal form, as the attribute list-of-children has internal structure of interest to the database. The solution would be to create a new table with two colums: parent and child, and each parent-child relationship would be an individual entry in the table.
Second normal form requires that the schema be in first normal form and that any non-prime attribute depends on the entirety of any candidate key, not on just some attributes of the candidate key. This must be ascertained (by looking at the constraints declared in the schema) for all candidate keys, not just the primary key, or for just candidate keys of interest. An example of a violation of second normal form would be a relation which relates parent with child with gender-of-parent. The pair of attributes (parent,child) is a candidate key, as it determines uniquely the row of the table (a parent can't have two genders). The entire three columns is a superkey, but any other combination is not a key, and therefore gender-of-parent is non-prime. However gender-of-parent does not depend on the entirety of the candidate key (parent, child), on the parent attribute within the key. Therefore the relation is not in second normal form. To normalize the table, remove gender-of-parent from the relation and create a new relation storing the parent, gender-of-parent pair.
Third normal form requires that the schema be in second normal form and that there are no functional dependencies among any of the non-prime attributes of the relation. An example of a violation of third normal form would be a relation which relates parent, gender-of-parent, and parent-type, where parent-type has value either father or mother. gender-or-parent and parent-type are both non-prime, and one attribute determines the other. To normalize the table, remove parent-type and create a new relation for the attribute pair gender-of-parent and parent-type. This relation will contain exactly two elements, (male, father) and (female, mother), and so in practice this table would not be necessary.
Finally, Boyce-Codd normal form is a third normal form which additionally requires that any functional dependency be a super key. Given that we have third normal form, the case not yet considered is a functional dependency depending on a mix of prime and non-prime attributes. Boyce-Codd requires that enough prime attributes are required so that the mix contain some candidate key. Also, Boyce-Codd does not limit consideration of the target to just non-prime attributes. (But I'm not sure if this is significant.)
Queries on the relational database are done through the operations of projection, selection and join. In the implemented database, these are expressed in the SQL language using the SELECT statement. We will talk about these operations abstractly, in terms of the mathematical model, and also provide sample syntax for SQL.
Though diverse, all these operations are accomlished in SQL by the SELECT statement. It is of the form:
SELECT attributes-with-commas FROM tables-with-commas -> WHERE conditions-on-attributes ;
Projection is accomplished by naming the attributes of interest in the attributes-with-commas part of the selection. Selection is accomplished by writing the selection criteria in the conditions-on-attributes following the keyword WHERE. The join is accomplished in the FROM part of the statement, either using the keywork JOIN or simply indicating all tables which are involved in the join.
A formal model for relational databases (as well as a strong criticism of SQL!) is given in the paper The third manifesto by Darwen and Date. I add some notes here for the curious.
The theory is pretty close to the reality of tables. An attribute is a name. Associated with each attribute is a domain, which is the same as a data type. A header is a collection of attributes with their associated domains. A header ressembles a structure defnition in programming, giving field names and the type of each field. A tuple for a header is a set of values, one for each attribute in the header, and of the specified type for that attribute. This would be a row in a table or an instance of a structure in programming languages.
A relation has a head and a body. The head is a header and the body is a collection of tuples appropriate for the header. Tuples cannot be repeated, nor are they ordered. In that sense the body is a set of tuples. The relation then models a table at a certain point in time (except the Third Manifesto did not rule out infinite tables, but we won't take this too seriously). Finally a relation variable (relvar) is a variable whose domain is relations of a given header. A relvar corresponds to any possible relation.
The concept of key is then relative to the relvar domain. That is, a candidate key is a subset of the attributes of the header which unique identifies a tuple in a relation for any value of the relvar.
Consider a database operation of deducting an object from stock. For the quantity must be tested, then decremented. Suppose in our web application, two users, connecting in separate with two browsers, need to deduct from stock the same item. It is possible that they both arrive at the database at the same time, and both deduct the item.
SQL databases provide several solutions. The simplest is provided by the database guarantee that update operations are atomic and serialized. If several update operations are commanded at the same time, the result in the database will be consistent with choosing and order of the operations, and running them one after the other, each operation completeing before the other beginning.
If table I has stores the Inventory, and the book ID B has stock quantity Q, the required update is:
update table I field Q=Q-1 where B=book_id and Q>0
Other situations might require more complicated techniques. Tables can be locked, so that only one process can operate on the table at a time. Locks come in two forms, write are read. A write lock allows for any read, but if a table is write locked only the lock owner can write. A read lock prevents any access, read or write, by any but the lock owner.
Locks have a very definite problem: if two processes compete for a set of locks, it is possible that each of the processes claim some of the locks mutually block each other from proceeding. This is called deadlock. The lock operation of SQL can prevent deadlock by guarenteeing that locks are taken in a standard order, so that there cannot be a circular dependency of locks which cause deadlock.
As a final technique, SQL provides for transactions. These wrap multiple operations into a single, all or nothing package.
We present a database schema for a book store.