Relational Databases: Lecture Notes

Terms

  • SQL – query language

  • row == tuple

  • column == attribute or field

  • table == relation

  • declarative language

  • data definition language

  • data manipulation language

Relational databases were introduced in the 1970s… why has the idea survived so long? query optimization!

We will be using sqlite3, a light weight relational database implementation.

We will be working with a database that has information about contributions to candidates for federal office.

Commands that start with . contain instructions to sqlite3, rather than queries, etc.

$ run sqlite3 fec.sqlite3 < create-fec.sql to create the fec database.
$ sqlite3 fec.sqlite3

sqlite> .tables
candidate     contribution

.mode column   /* line up the columns in output */
.header ON     /* output headers with results */

Schema:

sqlite> .schema candidate
CREATE TABLE candidate
  (committee_id varchar(10),
   lastname varchar(50),
   firstname varchar(50),
   party varchar(6),
   city varchar(20),
   state varchar(2),
   zip integer,
   cand_id varchar(10),
   district varchar(2),
   constraint pk_candidate primary key (committee_id));

The data has already been loaded by the script that created the tables. Here’s how you would load the data in bulk, if necessary:

sqlite> .separator ","
sqlite> .import candidate.csv candidate

You can type in queries from the command-line:

sqlite> SELECT * FROM candidate WHERE party = "REP" LIMIT 10;

or read queries from a file:

sqlite> .read q1.sql

where the file q1.sql contains:

-- Pandas: df[df.party == "REP"]

SELECT * FROM candidate WHERE party = "REP" LIMIT 10;

Please note the line that starts with -- is a comment and is ignored by the query engine.

SQL Basics

Selection operation

Selection: use a boolean expression to describe which tuples should be included in the result. For example, list all candidates FROM a specific party:

SELECT * FROM candidate WHERE party = "REP";

Use collate to handle differences in case: rep, Rep, REP:

SELECT * FROM candidate WHERE party = "rep" COLLATE NOCASE;

In the case of the candidate data, all of the party names are in all caps, so the collate is unnecessary.

Note the preferred style is to use all CAPS for keywords, but it is not required.

Projection

Projection is used to describe which attributes should be included in the result. For example, list the names of all the candidates:

SELECT lastname, firstname FROM candidate;

Combine ideas from selection and projection to find the names of all the candidates in the Republican party:

SELECT lastname, firstname FROM candidate WHERE party = "REP";

Use limit to reduce the number that are included in the result:

SELECT lastname, firstname FROM candidate WHERE party = "REP" LIMIT 10;

Aggregation

The simplest form of aggregation is counting rows:

sqlite> SELECT count(*) FROM candidate WHERE party = "REP";
count(*)
----------
2367

Use the AS operator to name an attribute:

sqlite> SELECT count(*) AS cnt FROM candidate WHERE party = "REP";
cnt
-------
2367

Group by provides another way to do aggregation: group the data using some criteria and then aggregate over the groups. For example, we might want to compute a table with the number of candidates per party:

SELECT party, count(*) AS cnt
FROM candidate
GROUP BY party;

We can sort the result, in increasing order by default:

SELECT party, count(*) AS cnt
FROM candidate
GROUP BY party
ORDER BY cnt;

Notice the use of AS to give a name to the count and the use of that name in the ORDER BY clause.

To sort in decreasing ORDER BY number of candidates add DESC:

SELECT party, count(*) AS cnt FROM candidate
GROUP BY party
ORDER BY cnt DESC;

To find the top 10 parties by candidate, add in a use of LIMIT:

SELECT party, count(*) AS cnt
FROM candidate
GROUP BY party
ORDER BY cnt DESC
LIMIT 10;

Finally, use HAVING to select which groups/aggregate results to include in the result. For example: find parties that have at least 100 candidates:

SELECT party, count(*) AS cnt
FROM candidate
GROUP BY party
HAVING cnt > 100
ORDER BY cnt;

The where clause specifies the rows to include BEFORE grouping. Having specifies which results to include AFTER grouping/aggregation.

A few extra examples

What was the total number of contributions made by residents of each state?

SELECT state, count(*) AS total
FROM contribution
GROUP BY state
ORDER BY state;

What is the smallest positive contribution in each state:

SELECT state, min(amount) AS total
FROM contribution
WHERE amount > -1
GROUP BY state
ORDER BY state;

Joins

Joins allow us to combine information from different tables or even using the same table (self-join).

Here is information about employees in a tiny firm. Each row contains an employee number and a pay per period:

001,1850
003,2000
004,1975
005,2450

Here’s the schema for this data:

CREATE TABLE employee_records
  (employee_number varchar(3),
   amount integer);

The employee number serves as a primary key for the employee table: that is, each employee number uniquely identifies a tuple.

Here’s some sample payroll data from a tiny company:

C1,001,1850
C2,002,2200
C3,003,1000
C4,003,1000

The first column contains a check number, the second is an employee number, and the third is an amount. We can create this table using the following schema:

CREATE TABLE payroll_ledger
  (check_number varchar(10),
   employee_number varchar(3),
   pay_per_period integer);

The check number serves the primary key for this table.

Let’s combine these two tables using a JOIN without specifying a join condition:

sqlite> SELECT * FROM payroll_ledger JOIN employee_records;
C1,001,1850,001,1850
C1,001,1850,003,2000
C1,001,1850,004,1975
C1,001,1850,005,2450
C2,002,2200,001,1850
C2,002,2200,003,2000
C2,002,2200,004,1975
C2,002,2200,005,2450
C3,003,1000,001,1850
C3,003,1000,003,2000
C3,003,1000,004,1975
C3,003,1000,005,2450
C4,003,1000,001,1850
C4,003,1000,003,2000
C4,003,1000,004,1975
C4,003,1000,005,2450

The result is the cross-product: every row in payroll_ledger is matched with every row in employee_record. This result is not all that useful. Let’s add a JOIN condition to keep only the rows that contain payroll and employee information for the same employee (as specified by the employee number):

sqlite> SELECT * FROM
   ...> payroll_ledger JOIN employee_records
   ...> ON payroll_ledger.employee_number = employee_records.employee_number;
C1,001,1850,001,1850
C3,003,1000,003,2000
C4,003,1000,003,2000

Notice that the result is much smaller. Also, notice that there is no entry for check C2. Check C2 references employee number 002, who does not appear in the employee records table. Also, notice employees 004 and 005 appear in the employee records table, but not in the payroll ledger. As a result, the join does not include entries for them.

We can use a LEFT JOIN to make sure that row that appears in table on the left of the JOIN operator appears in the result:

sqlite> SELECT * FROM
   ...> payroll_ledger LEFT JOIN employee_records
   ...> ON payroll_ledger.employee_number = employee_records.employee_number;
C1,001,1850,001,1850
C2,002,2200,,
C3,003,1000,003,2000
C4,003,1000,003,2000

Notice that check C2 appears in the result with nulls for the employee record values. Sqlite3 does not support RIGHT JOIN, which includes all rows from the right operand of the JOIN at least once or OUTER JOIN, which includes all rows from both operands at least once.

Here are some sample queries that use JOIN on the candidate and contribution data.

Combine data from candidate and contribution tables using the candidate ID as the join key:

SELECT *
FROM candidate JOIN contribution
ON candidate.cand_id = contribution.cand_id
LIMIT 10;

Same as the previous example, except use projection to get rid of duplicate information:

SELECT cand.lastname, cand.firstname, cand.state AS cand_state,
       cont.state AS cont_state, cont.amount
FROM candidate AS cand JOIN contribution AS cont
ON cand.cand_id = cont.cand_id
LIMIT 10;

Add in grouping: find the top 3 candidates in terms of total contributions. Exclude negative amounts.

SELECT cand.cand_id, cand.lastname, cand.firstname, SUM(cont.amount) as total
FROM candidate AS cand JOIN contribution AS cont
ON cand.cand_id = cont.cand_id
WHERE (cont.amount > 0) and (cand.state = cont.state)
GROUP BY cand.cand_id
ORDER BY total DESC
LIMIT 3;

Nested queries

Queries can be nested in a variety of ways. Here’s a simple example: how many candidates received at least one positive contribution?

SELECT COUNT(*)
FROM candidate
WHERE candidate.cand_id IN (SELECT DISTINCT cand_id FROM contribution WHERE amount > 0);

Here’s a sightly more complex query: list candidates who had at least one contribution of more than three times the average contribution:

SELECT DISTINCT lastname, firstname
FROM candidate JOIN contribution ON candidate.cand_id = contribution.cand_id
WHERE amount > 3 * (SELECT AVG(amount) FROM contribution);

DISTINCT eliminates duplicates from the result.

The CASE clause allows us to include different values for an attribute based on a boolean condition. The general form is:

CASE WHEN [condition] THEN [expression1] ELSE [expression2] END

The attribute will have the value of expression1 if the condition is true and the value of expression2 otherwise. Here is an example query that yields a table that includes columns for lastname, homestate, candidate id, an in-state contribution amount, and out of state contribution amount (note that only one of these two amounts will be non-zero):

SELECT lastname, candidate.state AS homestate, candidate.cand_id AS cand_id,
        CASE WHEN candidate.state= contribution.state THEN amount else 0 END AS isc,
        CASE WHEN candidate.state != contribution.state THEN amount else 0 end AS oosc
FROM candidate INNER JOIN contribution
ON candidate.cand_id = contribution.cand_id;

This query can be nested within another to produce a table of the total amount of in-state contributions to a candidate and the total amount of out-of-state contributions:

SELECT lastname, homestate, SUM(isc), SUM(oosc)
FROM (SELECT lastname, candidate.state AS homestate, candidate.cand_id AS cand_id,
        CASE WHEN candidate.state= contribution.state THEN amount else 0 END AS isc,
        CASE WHEN candidate.state != contribution.state THEN amount else 0 end AS oosc
      FROM candidate INNER JOIN contribution ON candidate.cand_id = contribution.cand_id)
GROUP BY cand_id;

Inserting/updating/deleting rows

Here’s the general form for inserting rows:

INSERT INTO <table>
VALUES (X,Y,Z)

Here’s an example:

INSERT INTO candidate
VALUES ("HHHH", "DUPONT", "TODD", "LIB", "CHICAGO", "IL", "60637", "CCCCC", "01");

This form requires that we supply a value for every attribute in the schema.

Here’s a different way to insert rows:

INSERT INTO <table> (<col name>, <col name>, <col name>)
VALUES (X,Y,Z)

and here’s a sample use:

INSERT INTO candidate (lastname, firstname, party, city, state, zip, cand_id, district)
VALUES ("FOO", "BAR", "DEM", "CHICAGO", "IL", "60637", "CCCCC", "01");

This form does not require a value for every attribute. Notice that we did not supply a value for the committee ID, for example. Values that are not supplied will be set to NULL.

Here’s the general form for updating rows:

UPDATE <table>
SET <col name> = <expr>, <col name> = <expr>, ...
WHERE <expr>

and here’s an example:

UPDATE candidate
SET committee_id = "H55555"
WHERE lastname="DUPONT";

Note that every row that matches the where clause will be updated!

Here’s the general form for deleting rows:

DELETE FROM <table>
WHERE <expr>

and here’s an example:

DELETE FROM candidate
WHERE lastname="DUPONT";

Again, every row that matches the where clause will be deleted!