Setup

First, make sure that you have the following packages installed on your system:

NY City Flights Example

In this example, we will

Create a temporary data base

Add a unique ID column

Here, we add a simple integer row id column. And as we do not know (and don’t care) if there is already any order in the data, we assign randomly permutated ids:

Create a SQLite data base with the flights data set

We store the SQLite data base in the temporary directory and then copy over the flight data set:

path = tempfile("flights", fileext = ".sqlite")
con = DBI::dbConnect(RSQLite::SQLite(), path)
tbl = copy_to(con, flights)

We now continue to work on the data set stored on the file system, referenced by tbl, and can remove the in-memory copy.

rm(flights) # remove the in-memory data set

Create a DataBackend

We reduce the data set to only keep a subset of features where the regression tree will have no problems:

Additionally, we remove those observations where the arrival delay (arr_delay) has a missing value:

tbl = filter(tbl, !is.na(arr_delay))

And to keep the runtimes reasonable for this toy example, we filter the data to only use every second row:

tbl = filter(tbl, row_id %% 2 == 0)

Finally, we merge the factor levels of feature carrier so that infrequent carriers are replaced by level “other”:

tbl = mutate(tbl, carrier = case_when(
    carrier %in% c("OO", "HA", "YV", "F9", "AS", "FL", "VX", "WN") ~ "other",
    TRUE ~ carrier)
)

This subsetted table is now used to create a mlr3db::DataBackendDplyr.

b = as_data_backend(tbl, primary_key = "row_id")

We can now use the interface of mlr3::DataBackend to query the data:

Note that the mlr3::DataBackend just does not know about any rows or columns we have filtered out with dplyr before creating the mlr3::DataBackend.

Create a regression task and fit regression trees

We create the following mlr3 objects:

We pass all these objects to mlr3::resample() to perform a simple resampling with three iterations. In each iteration, only the required subset of the data is queried from the SQLite data base and passed to rpart::rpart():

Cleanup

Finally, we remove the tbl object and close the connection.