Relational Databases: Lecture Notes¶
Terms¶
SQL – query language
row == tuple
column == attribute or field
table == relation
declarative language
data definition language
data manipulation language
Relational databases were introduced in the 1970s… why has the idea survived so long? query optimization!
We will be using sqlite3
, a light weight relational database
implementation.
We will be working with a database that has information about contributions to candidates for federal office.
Commands that start with . contain instructions to sqlite3, rather than queries, etc.
$ run sqlite3 fec.sqlite3 < create-fec.sql to create the fec database.
$ sqlite3 fec.sqlite3
sqlite> .tables
candidate contribution
.mode column /* line up the columns in output */
.header ON /* output headers with results */
Schema:
sqlite> .schema candidate
CREATE TABLE candidate
(committee_id varchar(10),
lastname varchar(50),
firstname varchar(50),
party varchar(6),
city varchar(20),
state varchar(2),
zip integer,
cand_id varchar(10),
district varchar(2),
constraint pk_candidate primary key (committee_id));
The data has already been loaded by the script that created the tables. Here’s how you would load the data in bulk, if necessary:
sqlite> .separator ","
sqlite> .import candidate.csv candidate
You can type in queries from the command-line:
sqlite> SELECT * FROM candidate WHERE party = "REP" LIMIT 10;
or read queries from a file:
sqlite> .read q1.sql
where the file q1.sql
contains:
-- Pandas: df[df.party == "REP"]
SELECT * FROM candidate WHERE party = "REP" LIMIT 10;
Please note the line that starts with --
is a comment and is
ignored by the query engine.
SQL Basics¶
Selection operation¶
Selection: use a boolean expression to describe which tuples should be included in the result. For example, list all candidates FROM a specific party:
SELECT * FROM candidate WHERE party = "REP";
Use collate to handle differences in case: rep, Rep, REP:
SELECT * FROM candidate WHERE party = "rep" COLLATE NOCASE;
In the case of the candidate data, all of the party names are in all caps, so the collate is unnecessary.
Note the preferred style is to use all CAPS for keywords, but it is not required.
Projection¶
Projection is used to describe which attributes should be included in the result. For example, list the names of all the candidates:
SELECT lastname, firstname FROM candidate;
Combine ideas from selection and projection to find the names of all the candidates in the Republican party:
SELECT lastname, firstname FROM candidate WHERE party = "REP";
Use limit to reduce the number that are included in the result:
SELECT lastname, firstname FROM candidate WHERE party = "REP" LIMIT 10;
Aggregation¶
The simplest form of aggregation is counting rows:
sqlite> SELECT count(*) FROM candidate WHERE party = "REP";
count(*)
----------
2367
Use the AS operator to name an attribute:
sqlite> SELECT count(*) AS cnt FROM candidate WHERE party = "REP";
cnt
-------
2367
Group by provides another way to do aggregation: group the data using some criteria and then aggregate over the groups. For example, we might want to compute a table with the number of candidates per party:
SELECT party, count(*) AS cnt
FROM candidate
GROUP BY party;
We can sort the result, in increasing order by default:
SELECT party, count(*) AS cnt
FROM candidate
GROUP BY party
ORDER BY cnt;
Notice the use of AS to give a name to the count and the use of that name in the ORDER BY clause.
To sort in decreasing ORDER BY number of candidates add DESC:
SELECT party, count(*) AS cnt FROM candidate
GROUP BY party
ORDER BY cnt DESC;
To find the top 10 parties by candidate, add in a use of LIMIT:
SELECT party, count(*) AS cnt
FROM candidate
GROUP BY party
ORDER BY cnt DESC
LIMIT 10;
Finally, use HAVING to select which groups/aggregate results to include in the result. For example: find parties that have at least 100 candidates:
SELECT party, count(*) AS cnt
FROM candidate
GROUP BY party
HAVING cnt > 100
ORDER BY cnt;
The where clause specifies the rows to include BEFORE grouping. Having specifies which results to include AFTER grouping/aggregation.
A few extra examples¶
What was the total number of contributions made by residents of each state?
SELECT state, count(*) AS total
FROM contribution
GROUP BY state
ORDER BY state;
What is the smallest positive contribution in each state:
SELECT state, min(amount) AS total
FROM contribution
WHERE amount > -1
GROUP BY state
ORDER BY state;
Joins¶
Joins allow us to combine information from different tables or even using the same table (self-join).
Here is information about employees in a tiny firm. Each row contains an employee number and a pay per period:
001,1850
003,2000
004,1975
005,2450
Here’s the schema for this data:
CREATE TABLE employee_records
(employee_number varchar(3),
amount integer);
The employee number serves as a primary key for the employee table: that is, each employee number uniquely identifies a tuple.
Here’s some sample payroll data from a tiny company:
C1,001,1850
C2,002,2200
C3,003,1000
C4,003,1000
The first column contains a check number, the second is an employee number, and the third is an amount. We can create this table using the following schema:
CREATE TABLE payroll_ledger
(check_number varchar(10),
employee_number varchar(3),
pay_per_period integer);
The check number serves the primary key for this table.
Let’s combine these two tables using a JOIN without specifying a join condition:
sqlite> SELECT * FROM payroll_ledger JOIN employee_records;
C1,001,1850,001,1850
C1,001,1850,003,2000
C1,001,1850,004,1975
C1,001,1850,005,2450
C2,002,2200,001,1850
C2,002,2200,003,2000
C2,002,2200,004,1975
C2,002,2200,005,2450
C3,003,1000,001,1850
C3,003,1000,003,2000
C3,003,1000,004,1975
C3,003,1000,005,2450
C4,003,1000,001,1850
C4,003,1000,003,2000
C4,003,1000,004,1975
C4,003,1000,005,2450
The result is the cross-product: every row in payroll_ledger is matched with every row in employee_record. This result is not all that useful. Let’s add a JOIN condition to keep only the rows that contain payroll and employee information for the same employee (as specified by the employee number):
sqlite> SELECT * FROM
...> payroll_ledger JOIN employee_records
...> ON payroll_ledger.employee_number = employee_records.employee_number;
C1,001,1850,001,1850
C3,003,1000,003,2000
C4,003,1000,003,2000
Notice that the result is much smaller. Also, notice that there is no
entry for check C2
. Check C2
references employee number
002
, who does not appear in the employee records table. Also,
notice employees 004
and 005
appear in the employee records
table, but not in the payroll ledger. As a result, the join does not
include entries for them.
We can use a LEFT JOIN to make sure that row that appears in table on the left of the JOIN operator appears in the result:
sqlite> SELECT * FROM
...> payroll_ledger LEFT JOIN employee_records
...> ON payroll_ledger.employee_number = employee_records.employee_number;
C1,001,1850,001,1850
C2,002,2200,,
C3,003,1000,003,2000
C4,003,1000,003,2000
Notice that check C2
appears in the result with nulls for the
employee record values. Sqlite3 does not support RIGHT JOIN, which
includes all rows from the right operand of the JOIN at least once or
OUTER JOIN, which includes all rows from both operands at least once.
Here are some sample queries that use JOIN on the candidate and contribution data.
Combine data from candidate and contribution tables using the candidate ID as the join key:
SELECT *
FROM candidate JOIN contribution
ON candidate.cand_id = contribution.cand_id
LIMIT 10;
Same as the previous example, except use projection to get rid of duplicate information:
SELECT cand.lastname, cand.firstname, cand.state AS cand_state,
cont.state AS cont_state, cont.amount
FROM candidate AS cand JOIN contribution AS cont
ON cand.cand_id = cont.cand_id
LIMIT 10;
Add in grouping: find the top 3 candidates in terms of total contributions. Exclude negative amounts.
SELECT cand.cand_id, cand.lastname, cand.firstname, SUM(cont.amount) as total
FROM candidate AS cand JOIN contribution AS cont
ON cand.cand_id = cont.cand_id
WHERE (cont.amount > 0) and (cand.state = cont.state)
GROUP BY cand.cand_id
ORDER BY total DESC
LIMIT 3;
Nested queries¶
Queries can be nested in a variety of ways. Here’s a simple example: how many candidates received at least one positive contribution?
SELECT COUNT(*)
FROM candidate
WHERE candidate.cand_id IN (SELECT DISTINCT cand_id FROM contribution WHERE amount > 0);
Here’s a sightly more complex query: list candidates who had at least one contribution of more than three times the average contribution:
SELECT DISTINCT lastname, firstname
FROM candidate JOIN contribution ON candidate.cand_id = contribution.cand_id
WHERE amount > 3 * (SELECT AVG(amount) FROM contribution);
DISTINCT eliminates duplicates from the result.
The CASE clause allows us to include different values for an attribute based on a boolean condition. The general form is:
CASE WHEN [condition] THEN [expression1] ELSE [expression2] END
The attribute will have the value of expression1 if the condition is true and the value of expression2 otherwise. Here is an example query that yields a table that includes columns for lastname, homestate, candidate id, an in-state contribution amount, and out of state contribution amount (note that only one of these two amounts will be non-zero):
SELECT lastname, candidate.state AS homestate, candidate.cand_id AS cand_id,
CASE WHEN candidate.state= contribution.state THEN amount else 0 END AS isc,
CASE WHEN candidate.state != contribution.state THEN amount else 0 end AS oosc
FROM candidate INNER JOIN contribution
ON candidate.cand_id = contribution.cand_id;
This query can be nested within another to produce a table of the total amount of in-state contributions to a candidate and the total amount of out-of-state contributions:
SELECT lastname, homestate, SUM(isc), SUM(oosc)
FROM (SELECT lastname, candidate.state AS homestate, candidate.cand_id AS cand_id,
CASE WHEN candidate.state= contribution.state THEN amount else 0 END AS isc,
CASE WHEN candidate.state != contribution.state THEN amount else 0 end AS oosc
FROM candidate INNER JOIN contribution ON candidate.cand_id = contribution.cand_id)
GROUP BY cand_id;
Inserting/updating/deleting rows¶
Here’s the general form for inserting rows:
INSERT INTO <table>
VALUES (X,Y,Z)
Here’s an example:
INSERT INTO candidate
VALUES ("HHHH", "DUPONT", "TODD", "LIB", "CHICAGO", "IL", "60637", "CCCCC", "01");
This form requires that we supply a value for every attribute in the schema.
Here’s a different way to insert rows:
INSERT INTO <table> (<col name>, <col name>, <col name>)
VALUES (X,Y,Z)
and here’s a sample use:
INSERT INTO candidate (lastname, firstname, party, city, state, zip, cand_id, district)
VALUES ("FOO", "BAR", "DEM", "CHICAGO", "IL", "60637", "CCCCC", "01");
This form does not require a value for every attribute. Notice that we did not supply a value for the committee ID, for example. Values that are not supplied will be set to NULL.
Here’s the general form for updating rows:
UPDATE <table>
SET <col name> = <expr>, <col name> = <expr>, ...
WHERE <expr>
and here’s an example:
UPDATE candidate
SET committee_id = "H55555"
WHERE lastname="DUPONT";
Note that every row that matches the where clause will be updated!
Here’s the general form for deleting rows:
DELETE FROM <table>
WHERE <expr>
and here’s an example:
DELETE FROM candidate
WHERE lastname="DUPONT";
Again, every row that matches the where clause will be deleted!