Lecture 16
Task | Timing (ns) | Timing (μs) |
---|---|---|
L1 cache reference | 0.5 | |
L2 cache reference | 7 | |
Main memory reference | 100 | 0.1 |
Random seek SSD | 150,000 | 150 |
Read 1 MB sequentially from memory | 250,000 | 250 |
Read 1 MB sequentially from SSD | 1,000,000 | 1,000 |
Disk seek | 10,000,000 | 10,000 |
Read 1 MB sequentially from disk | 20,000,000 | 20,000 |
Send packet CA->Netherlands->CA | 150,000,000 | 150,000 |
Lets imagine we have a 10 GB flat data file and that we want to select certain rows based on a particular criteria. This requires a sequential read across the entire data set.
File Location | Performance | Time |
---|---|---|
in memory | \(10~GB \times (250~\mu s / 1~MB)\) | 2.5 seconds |
on disk (SSD) | \(10~GB \times (1~ms / 1~MB)\) | 10 seconds |
on disk (HD) | \(10~GB \times (20~ms / 1~MB)\) | 200 seconds |
This is just for reading sequential data, if we make any modifications (writing) or the data is fragmented things are much worse.
Disk << SSD <<< Memory
Disk <<< SSD << Memory
So usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and if we can’t fit everything into memory?
Create blocks - group related data (i.e. rows) and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk.
Even with blocks, any kind of querying / subsetting of rows requires a linear search, which requires \(\mathcal{O}(N)\) reads.
We can do better if we are careful about how we structure our data, specifically sorting’ some (or all) of the columns.
Sorting is expensive, \(\mathcal{O}(N \log N)\), but it only needs to be done once.
After sorting, we can use a binary search for any subsetting tasks - \(\mathcal{O}(\log N)\) ).
In a databases these “sorted” columns are refered to as indexes.
Indexes require additional storage, but usually small enough to be kept in memory even if blocks need to stay on disk.
This is just barely scratching the surface,
Efficiency gains are not just for disk, access is access
In general, trade off between storage and efficiency
Reality is a lot more complicated for everything mentioned so far, lots of very smart people have spent a lot of time thinking about and implementing tools
Different tasks with different requirements require different implementations and have different criteria for optimization
Low level package for interfacing R with Database management systems (DBMS) that provides a common interface to achieve the following functionality:
Provides the implementation necessary to use DBI to interface with an SQLite database.
this package also loads the necessary DBI functions as well (via re-exporting).
Once loaded we can create a connection to our database,
Formal class 'SQLiteConnection' [package "RSQLite"] with 8 slots
..@ ptr :<externalptr>
..@ dbname : chr ":memory:"
..@ loadable.extensions: logi TRUE
..@ flags : int 70
..@ vfs : chr ""
..@ ref :<environment: 0x126d55040>
..@ bigint : chr "integer64"
..@ extended_types : logi FALSE
employees = tibble(
name = c("Alice","Bob","Carol","Dave","Eve","Frank"),
email = c("alice@company.com", "bob@company.com",
"carol@company.com", "dave@company.com",
"eve@company.com", "frank@comany.com"),
salary = c(52000, 40000, 30000, 33000, 44000, 37000),
dept = c("Accounting", "Accounting","Sales",
"Accounting","Sales","Sales"),
)
Databases queries are transactional (see ACID) and are broken up into 3 steps:
<SQLiteResult>
SQL SELECT * FROM employees
ROWS Fetched: 0 [incomplete]
Changed: 0
There is also dbGetQuery()
which combines all three steps,
name email salary dept
1 Alice alice@company.com 52000 Accounting
2 Bob bob@company.com 40000 Accounting
3 Carol carol@company.com 30000 Sales
4 Dave dave@company.com 33000 Accounting
5 Eve eve@company.com 44000 Sales
6 Frank frank@comany.com 37000 Sales
dbCreateTable()
will create a new table with a schema based on an existing data.frame / tibble, but it does not populate that table with data.
Data can be added to an existing table via dbAppendTable()
.
# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ℹ 140 more rows
db = DBI::dbConnect(RSQLite::SQLite(), "flights.sqlite")
( flight_tbl = dplyr::copy_to(
db, nycflights13::flights, name = "flights", temporary = FALSE) )
# Source: table<flights> [?? x 19]
# Database: sqlite 3.41.2 [flights.sqlite]
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# ℹ more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dbl>
All of this data now lives in the database on the filesystem not in memory,
flight_tbl
?List of 2
$ year :List of 2
..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 8 slots
.. .. ..@ ptr :<externalptr>
.. .. ..@ dbname : chr "flights.sqlite"
.. .. ..@ loadable.extensions: logi TRUE
.. .. ..@ flags : int 70
.. .. ..@ vfs : chr ""
.. .. ..@ ref :<environment: 0x1278aae48>
.. .. ..@ bigint : chr "integer64"
.. .. ..@ extended_types : logi FALSE
..$ disco: NULL
..- attr(*, "class")= chr [1:4] "src_SQLiteConnection" "src_dbi" "src_sql" "src"
$ month :List of 5
..$ x : 'ident' chr "flights"
..$ vars : chr [1:19] "year" "month" "day" "dep_time" ...
..$ group_vars: chr(0)
..$ order_vars: NULL
..$ frame : NULL
..- attr(*, "class")= chr [1:3] "lazy_base_remote_query" "lazy_base_query" "lazy_query"
- attr(*, "class")= chr [1:5] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
# Source: table<flights> [?? x 19]
# Database: sqlite 3.41.2 [flights.sqlite]
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# ℹ more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dbl>
# Source: SQL [?? x 3]
# Database: sqlite 3.41.2 [flights.sqlite]
origin dest tailnum
<chr> <chr> <chr>
1 EWR CLT N152UW
2 EWR IAH N535UA
3 JFK MIA N5BSAA
4 JFK SJU N531JB
5 JFK BQN N827JB
6 LGA IAH N15710
7 JFK IAD N825AS
8 EWR TPA N802UA
9 LGA ATL N996DL
10 JFK FLL N627JB
# ℹ more rows
dplyr / dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.
When building a query, we don’t want the entire table, often we want just enough to check if our query is working / makes sense.
Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.
Therefore, by default dplyr
won’t connect and query the database until absolutely necessary (e.g. show output),
and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like.
we can force evaluation via compute()
, collect()
, or collapse()
show_query()
In general, dplyr / dbplyr knows how to translate basic math, logical, and summary functions from R to SQL. dbplyr has a function, translate_sql()
, that lets you experiment with how R functions are translated to SQL.
By default dbplyr::translate_sql()
will translate R / dplyr code into ANSI SQL, if we want to see results specific to a certain database we can pass in a connection object,
Error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! no such function: grepl
There are two packages that implement this in R which take very different approaches,
data(flights, package = "nycflights13")
tidyquery::query(
"SELECT origin, dest, COUNT(*) AS n
FROM flights
WHERE month = 10 AND day = 21
GROUP BY origin, dest"
)
# A tibble: 181 × 3
origin dest n
<chr> <chr> <int>
1 EWR ATL 15
2 EWR AUS 3
3 EWR AVL 1
4 EWR BNA 7
5 EWR BOS 17
6 EWR BTV 3
7 EWR BUF 2
8 EWR BWI 1
9 EWR CHS 4
10 EWR CLE 4
# ℹ 171 more rows
flights |>
tidyquery::query(
"SELECT origin, dest, COUNT(*) AS n
WHERE month = 10 AND day = 21
GROUP BY origin, dest"
) |>
arrange(desc(n))
# A tibble: 181 × 3
origin dest n
<chr> <chr> <int>
1 JFK LAX 32
2 LGA ORD 31
3 LGA ATL 30
4 JFK SFO 24
5 LGA CLT 22
6 EWR ORD 18
7 EWR SFO 18
8 EWR BOS 17
9 LGA MIA 17
10 EWR LAX 16
# ℹ 171 more rows
sqldf::sqldf(
"SELECT origin, dest, COUNT(*) AS n
FROM flights
WHERE month = 10 AND day = 21
GROUP BY origin, dest"
)
origin dest n
1 EWR ATL 15
2 EWR AUS 3
3 EWR AVL 1
4 EWR BNA 7
5 EWR BOS 17
6 EWR BTV 3
7 EWR BUF 2
8 EWR BWI 1
9 EWR CHS 4
10 EWR CLE 4
11 EWR CLT 15
12 EWR CMH 3
13 EWR CVG 9
14 EWR DAY 4
15 EWR DCA 3
16 EWR DEN 8
17 EWR DFW 9
18 EWR DSM 2
19 EWR DTW 10
20 EWR FLL 10
21 EWR GRR 2
22 EWR GSO 4
23 EWR GSP 2
24 EWR HNL 1
25 EWR HOU 3
26 EWR IAD 5
27 EWR IAH 11
28 EWR IND 5
29 EWR JAX 4
30 EWR LAS 6
31 EWR LAX 16
32 EWR MCI 4
33 EWR MCO 13
34 EWR MDW 6
35 EWR MEM 3
36 EWR MHT 3
37 EWR MIA 7
38 EWR MKE 3
39 EWR MSN 1
40 EWR MSP 10
41 EWR MSY 4
42 EWR OKC 1
43 EWR OMA 2
44 EWR ORD 18
45 EWR ORF 1
46 EWR PBI 5
47 EWR PDX 2
48 EWR PHX 7
49 EWR PIT 1
50 EWR PVD 1
51 EWR PWM 1
52 EWR RDU 4
53 EWR RIC 5
54 EWR RSW 3
55 EWR SAN 3
56 EWR SAT 1
57 EWR SAV 2
58 EWR SDF 3
59 EWR SEA 5
60 EWR SFO 18
61 EWR SJU 2
62 EWR SLC 1
63 EWR SNA 3
64 EWR STL 8
65 EWR TPA 6
66 EWR TUL 1
67 EWR TYS 1
68 EWR XNA 1
69 JFK ABQ 1
70 JFK ATL 5
71 JFK AUS 4
72 JFK BNA 2
73 JFK BOS 16
74 JFK BQN 1
75 JFK BTV 4
76 JFK BUF 12
77 JFK BUR 1
78 JFK BWI 3
79 JFK CHS 3
80 JFK CLE 1
81 JFK CLT 8
82 JFK CMH 2
83 JFK CVG 3
84 JFK DCA 9
85 JFK DEN 2
86 JFK DFW 2
87 JFK DTW 3
88 JFK FLL 9
89 JFK HNL 1
90 JFK HOU 2
91 JFK IAD 7
92 JFK IAH 1
93 JFK IND 2
94 JFK JAX 3
95 JFK LAS 12
96 JFK LAX 32
97 JFK LGB 2
98 JFK MCI 1
99 JFK MCO 14
100 JFK MIA 9
101 JFK MSP 3
102 JFK MSY 5
103 JFK OAK 1
104 JFK ORD 7
105 JFK ORF 2
106 JFK PBI 4
107 JFK PDX 2
108 JFK PHL 2
109 JFK PHX 6
110 JFK PIT 3
111 JFK PSE 1
112 JFK PWM 4
113 JFK RDU 9
114 JFK ROC 4
115 JFK RSW 2
116 JFK SAN 5
117 JFK SAT 1
118 JFK SEA 7
119 JFK SFO 24
120 JFK SJC 1
121 JFK SJU 11
122 JFK SLC 6
123 JFK SMF 1
124 JFK SRQ 1
125 JFK SYR 4
126 JFK TPA 8
127 LGA ATL 30
128 LGA BGR 2
129 LGA BHM 1
130 LGA BNA 11
131 LGA BOS 16
132 LGA BTV 1
133 LGA BUF 2
134 LGA CAK 2
135 LGA CHS 3
136 LGA CLE 6
137 LGA CLT 22
138 LGA CMH 7
139 LGA CVG 1
140 LGA DAY 1
141 LGA DCA 16
142 LGA DEN 11
143 LGA DFW 14
144 LGA DSM 1
145 LGA DTW 15
146 LGA FLL 10
147 LGA GSO 4
148 LGA GSP 1
149 LGA HOU 2
150 LGA IAD 6
151 LGA IAH 9
152 LGA ILM 1
153 LGA IND 1
154 LGA JAX 2
155 LGA MCI 1
156 LGA MCO 10
157 LGA MDW 6
158 LGA MEM 3
159 LGA MIA 17
160 LGA MKE 5
161 LGA MSN 2
162 LGA MSP 11
163 LGA MSY 3
164 LGA OMA 1
165 LGA ORD 31
166 LGA ORF 1
167 LGA PBI 6
168 LGA PHL 2
169 LGA PIT 5
170 LGA RDU 11
171 LGA RIC 3
172 LGA ROC 2
173 LGA RSW 2
174 LGA SAV 1
175 LGA SDF 1
176 LGA SRQ 2
177 LGA STL 6
178 LGA SYR 1
179 LGA TPA 6
180 LGA TYS 1
181 LGA XNA 3
sqldf::sqldf(
"SELECT origin, dest, COUNT(*) AS n
FROM flights
WHERE month = 10 AND day = 21
GROUP BY origin, dest"
) |>
as_tibble() |>
arrange(desc(n))
# A tibble: 181 × 3
origin dest n
<chr> <chr> <int>
1 JFK LAX 32
2 LGA ORD 31
3 LGA ATL 30
4 JFK SFO 24
5 LGA CLT 22
6 EWR ORD 18
7 EWR SFO 18
8 EWR BOS 17
9 LGA MIA 17
10 EWR LAX 16
# ℹ 171 more rows
The ability of dplyr to translate from R expression to SQL is an incredibly powerful tool making your data processing workflows portable across a wide variety of data backends.
Some tools and ecosystems that are worth learning about:
Sta 523 - Fall 2023