Lecture 17
Structures Query Language is a special purpose language for interacting with (querying and modifying) indexed tabular data.
ANSI Standard but with dialect divergence (MySql, Postgres, SQLite, etc.)
This functionality maps very closely (but not exactly) with the data manipulation verbs present in dplyr.
SQL is likely to be a foundational skill if you go into industry - learn it and put it on your CV
The following is specific to SQLite
We can make this table output a little nicer with some additonal SQLite options:
sqlite> .mode column
sqlite> .headers on
sqlite> SELECT * FROM employees;
## name email salary dept
## ---------- ----------------- ---------- ----------
## Alice alice@company.com 52000.0 Accounting
## Bob bob@company.com 40000.0 Accounting
## Carol carol@company.com 30000.0 Sales
## Dave dave@company.com 33000.0 Accounting
## Eve eve@company.com 44000.0 Sales
## Frank frank@comany.com 37000.0 Sales
We can subset for certain columns (and rename them) using SELECT
We can sort our results by adding ORDER BY
to our SELECT
statement
We can filter rows by adding WHERE
to our statements
We can create groups for the purpose of summarizing using GROUP BY
. As with dplyr it is not terribly useful by itself.
We can limit the number of rows we get by using LIMIT
and order results with ORDER BY
with or without DESC
Using sqlite calculate the following quantities,
The total costs in payroll for this company
The average salary within each department
By default SQLite uses a CROSS JOIN
which is not terribly useful most of the time (similar to R’s expand.grid()
)
sqlite> SELECT * FROM employees JOIN phone;
## name email salary dept name phone
## ---------- ----------------- ---------- ---------- ---------- ------------
## Alice alice@company.com 52000.0 Accounting Bob 919 555-1111
## Alice alice@company.com 52000.0 Accounting Carol 919 555-2222
## Alice alice@company.com 52000.0 Accounting Eve 919 555-3333
## Alice alice@company.com 52000.0 Accounting Frank 919 555-4444
## Bob bob@company.com 40000.0 Accounting Bob 919 555-1111
## Bob bob@company.com 40000.0 Accounting Carol 919 555-2222
## Bob bob@company.com 40000.0 Accounting Eve 919 555-3333
## Bob bob@company.com 40000.0 Accounting Frank 919 555-4444
## Carol carol@company.com 30000.0 Sales Bob 919 555-1111
## Carol carol@company.com 30000.0 Sales Carol 919 555-2222
## Carol carol@company.com 30000.0 Sales Eve 919 555-3333
## Carol carol@company.com 30000.0 Sales Frank 919 555-4444
## Dave dave@company.com 33000.0 Accounting Bob 919 555-1111
## Dave dave@company.com 33000.0 Accounting Carol 919 555-2222
## Dave dave@company.com 33000.0 Accounting Eve 919 555-3333
## Dave dave@company.com 33000.0 Accounting Frank 919 555-4444
## Eve eve@company.com 44000.0 Sales Bob 919 555-1111
## Eve eve@company.com 44000.0 Sales Carol 919 555-2222
## Eve eve@company.com 44000.0 Sales Eve 919 555-3333
## Eve eve@company.com 44000.0 Sales Frank 919 555-4444
## Frank frank@comany.com 37000.0 Sales Bob 919 555-1111
## Frank frank@comany.com 37000.0 Sales Carol 919 555-2222
## Frank frank@comany.com 37000.0 Sales Eve 919 555-3333
## Frank frank@comany.com 37000.0 Sales Frank 919 555-4444
If you want SQLite to find the columns to merge on automatically then we prefix the join with NATURAL
.
sqlite> SELECT * FROM employees NATURAL JOIN phone;
## name email salary dept phone
## ---------- --------------- ---------- ---------- ------------
## Bob bob@company.com 40000.0 Accounting 919 555-1111
## Carol carol@company.c 30000.0 Sales 919 555-2222
## Eve eve@company.com 44000.0 Sales 919 555-3333
## Frank frank@comany.co 37000.0 Sales 919 555-4444
sqlite> SELECT * FROM employees JOIN phone ON employees.name = phone.name;
## name email salary dept name phone
## ---------- --------------- ---------- ---------- ---------- ------------
## Bob bob@company.com 40000.0 Accounting Bob 919 555-1111
## Carol carol@company.c 30000.0 Sales Carol 919 555-2222
## Eve eve@company.com 44000.0 Sales Eve 919 555-3333
## Frank frank@comany.co 37000.0 Sales Frank 919 555-4444
to avoid the duplicate name
column we can use USING
instead of ON
sqlite> SELECT * FROM employees JOIN phone USING(name);
## name email salary dept phone
## ----- ----------------- ------- ---------- ------------
## Bob bob@company.com 40000.0 Accounting 919 555-1111
## Carol carol@company.com 30000.0 Sales 919 555-2222
## Eve eve@company.com 44000.0 Sales 919 555-3333
## Frank frank@comany.com 37000.0 Sales 919 555-4444
As a rule, the USING
(or NATURAL
) clause is used if the column names match between tables, otherwise ON
is needed.
sqlite> SELECT * FROM employees NATURAL LEFT JOIN phone;
## name email salary dept phone
## ---------- ----------------- ---------- ---------- ----------
## Alice alice@company.com 52000.0 Accounting
## Bob bob@company.com 40000.0 Accounting 919 555-11
## Carol carol@company.com 30000.0 Sales 919 555-22
## Dave dave@company.com 33000.0 Accounting
## Eve eve@company.com 44000.0 Sales 919 555-33
## Frank frank@comany.com 37000.0 Sales 919 555-44
sqlite> SELECT * FROM employees LEFT JOIN phone ON employees.name = phone.name;
## name email salary dept name phone
## ---------- ----------------- ---------- ---------- ---------- ----------
## Alice alice@company.com 52000.0 Accounting
## Bob bob@company.com 40000.0 Accounting Bob 919 555-11
## Carol carol@company.com 30000.0 Sales Carol 919 555-22
## Dave dave@company.com 33000.0 Accounting
## Eve eve@company.com 44000.0 Sales Eve 919 555-33
## Frank frank@comany.com 37000.0 Sales Frank 919 555-44
As above to avoid the duplicate name
column we can use USING
, or can be more selective about our returned columns,
sqlite> SELECT employees.*, phone FROM employees LEFT JOIN phone ON employees.name = phone.name;
## name email salary dept phone
## ----- ----------------- ------- ---------- ------------
## Alice alice@company.com 52000.0 Accounting
## Bob bob@company.com 40000.0 Accounting 919 555-1111
## Carol carol@company.com 30000.0 Sales 919 555-2222
## Dave dave@company.com 33000.0 Accounting
## Eve eve@company.com 44000.0 Sales 919 555-3333
## Frank frank@comany.com 37000.0 Sales 919 555-4444
Note that SQLite does not support directly support an OUTER JOIN
(e.g a full join in dplyr) or a RIGHT JOIN
.
A RIGHT JOIN
can be achieved by swapping the two tables (i.e. A right join B is equivalent to B left join A)
An OUTER JOIN
can be achieved via using UNION ALL
with both left joins (A on B and B on A)
We can nest tables within tables for the purpose of queries.
SELECT * FROM (SELECT * FROM employees NATURAL LEFT JOIN phone) WHERE phone IS NULL;
## name email salary dept phone
## ---------- ----------------- ---------- ---------- ----------
## Alice alice@company.com 52000.0 Accounting
## Dave dave@company.com 33000.0 Accounting
sqlite> SELECT * FROM (SELECT * FROM employees NATURAL LEFT JOIN phone) WHERE phone IS NOT NULL;
## name email salary dept phone
## ---------- --------------- ---------- ---------- ------------
## Bob bob@company.com 40000.0 Accounting 919 555-1111
## Carol carol@company.c 30000.0 Sales 919 555-2222
## Eve eve@company.com 44000.0 Sales 919 555-3333
## Frank frank@comany.co 37000.0 Sales 919 555-4444
Lets try to create a table that has a new column - abv_avg
which contains how much more (or less) than the average, for their department, each person is paid.
Hint - This will require joining a subquery.
employees.sqlite
is available in the exercises repo.
To give us a bit more variety, we have created another SQLite database flights.sqlite
that contains both nycflights13::flights
and nycflights13::planes
, the latter of which has details on the characteristics of the planes in the dataset as identified by their tail numbers.
All of the following code will be run in the SQLite command line interface, make sure you’ve created the database and copied both the flights and planes tables into the db.
flights.sqlite
The database can then be opened from the terminal tab using,
As before we should set a couple of configuration options so that our output is readable, we include .timer on
so that we get time our queries.
flights
sqlite> SELECT * FROM flights LIMIT 10;
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
## ---- ----- --- -------- -------------- --------- -------- -------------- --------- ------- ------ ------- ------ ---- -------- -------- ---- ------ ------------
## 2013 1 1 517 515 2.0 830 819 11.0 UA 1545 N14228 EWR IAH 227.0 1400.0 5.0 15.0 1357034400.0
## 2013 1 1 533 529 4.0 850 830 20.0 UA 1714 N24211 LGA IAH 227.0 1416.0 5.0 29.0 1357034400.0
## 2013 1 1 542 540 2.0 923 850 33.0 AA 1141 N619AA JFK MIA 160.0 1089.0 5.0 40.0 1357034400.0
## 2013 1 1 544 545 -1.0 1004 1022 -18.0 B6 725 N804JB JFK BQN 183.0 1576.0 5.0 45.0 1357034400.0
## 2013 1 1 554 600 -6.0 812 837 -25.0 DL 461 N668DN LGA ATL 116.0 762.0 6.0 0.0 1357038000.0
## 2013 1 1 554 558 -4.0 740 728 12.0 UA 1696 N39463 EWR ORD 150.0 719.0 5.0 58.0 1357034400.0
## 2013 1 1 555 600 -5.0 913 854 19.0 B6 507 N516JB EWR FLL 158.0 1065.0 6.0 0.0 1357038000.0
## 2013 1 1 557 600 -3.0 709 723 -14.0 EV 5708 N829AS LGA IAD 53.0 229.0 6.0 0.0 1357038000.0
## 2013 1 1 557 600 -3.0 838 846 -8.0 B6 79 N593JB JFK MCO 140.0 944.0 6.0 0.0 1357038000.0
## 2013 1 1 558 600 -2.0 753 745 8.0 AA 301 N3ALAA LGA ORD 138.0 733.0 6.0 0.0 1357038000.0
##
## Run Time: real 0.051 user 0.000258 sys 0.000126
planes
sqlite> SELECT * FROM planes LIMIT 10;
## tailnum year type manufacturer model engines seats speed engine
## ------- ---- ----------------------- ---------------- --------- ------- ----- ----- ---------
## N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2 55 Turbo-fan
## N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 Turbo-fan
## N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 Turbo-fan
## N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 Turbo-fan
## N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2 55 Turbo-fan
## N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 Turbo-fan
## N107US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 Turbo-fan
## N108UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 Turbo-fan
## N109UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 Turbo-fan
## N110UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 Turbo-fan
##
## Run Time: real 0.001 user 0.000159 sys 0.000106
Write a query that determines the total number of seats available on all of the planes that flew out of New York in 2013.
Why?
. . .
EXPLAIN QUERY PLAN
Key things to look for:
SCAN
- indicates that a full table scan is occurring
SEARCH
- indicates that only a subset of the table rows are visited
AUTOMATIC COVERING INDEX
- indicates that a temporary index has been created for this query
sqlite> CREATE INDEX flight_tailnum ON flights (tailnum);
## Run Time: real 0.241 user 0.210099 sys 0.027611
sqlite> EXPLAIN QUERY PLAN SELECT sum(seats) FROM flights
LEFT JOIN planes USING (tailnum);
## QUERY PLAN
## |--SCAN flights USING COVERING INDEX flight_tailnum
## `--SEARCH planes USING INDEX plane_tailnum (tailnum=?)
sqlite> EXPLAIN QUERY PLAN SELECT sum(seats) FROM (SELECT tailnum FROM flights)
LEFT JOIN (SELECT tailnum, seats FROM planes) USING (tailnum);
## QUERY PLAN
## |--MATERIALIZE SUBQUERY 2
## | `--SCAN planes
## |--SCAN flights USING COVERING INDEX flight_tailnum
## `--SEARCH SUBQUERY 2 USING AUTOMATIC COVERING INDEX (tailnum=?)
An index can be created on more than one column at a time. This is useful for queries that filter on multiple columns, but note that the order of the columns in the index matters.
!=
alternativesqlite> SELECT origin, count(*) FROM flights
WHERE origin > "EWR" OR origin < "EWR";
## origin count(*)
## ------ --------
## JFK 215941
##
## Run Time: real 0.020 user 0.021148 sys 0.001290
sqlite> EXPLAIN QUERY PLAN SELECT origin, count(*) FROM flights
WHERE origin > "EWR" OR origin < "EWR";
## QUERY PLAN
## `--MULTI-INDEX OR
## |--INDEX 1
## | `--SEARCH flights USING COVERING INDEX flights_orig_dest (origin>?)
## `--INDEX 2
## `--SEARCH flights USING COVERING INDEX flights_orig_dest (origin<?)
sqlite> SELECT carrier, count(*) FROM flights
GROUP BY carrier;
## carrier count(*)
## ------- --------
## 9E 18460
## AA 32729
## AS 714
## B6 54635
## DL 48110
## EV 54173
## F9 685
## FL 3260
## HA 342
## MQ 26397
## OO 32
## UA 58665
## US 20536
## VX 5162
## WN 12275
## YV 601
##
## Run Time: real 0.172 user 0.114274 sys 0.018946
sqlite> SELECT carrier, count(*) FROM flights
GROUP BY carrier;
## carrier count(*)
## ------- --------
## 9E 18460
## AA 32729
## AS 714
## B6 54635
## DL 48110
## EV 54173
## F9 685
## FL 3260
## HA 342
## MQ 26397
## OO 32
## UA 58665
## US 20536
## VX 5162
## WN 12275
## YV 601
##
## Run Time: real 0.023 user 0.022521 sys 0.000411
As mentioned before, creating an index requires additional storage (memory or disk)
Additionally, when adding or updating data - indexes also need to be updated, making these processes slower (read vs. write tradeoffs)
Index order matters - flights (origin, dest)
, flights (dest, origin)
are not the same and similarly are not the same as separate indexes on dest
and origin
.
Sta 523 - Fall 2023