vignettes/linear-regression.Rmd
linear-regression.Rmd
The linear_regression_db()
function can be used to fit
this kind of model inside a database. It uses dplyr
programming to abstract the steps needed produce a model, so that it can
then be translated into SQL statements in the background.
A lightweight SQLite database will be used for this article. Additionally, a sample data set is created.
# Open a database connection
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
RSQLite::initExtension(con)
library(dplyr)
# Copy data to the database
db_flights <- copy_to(con, nycflights13::flights, "flights")
# Create a simple sample
db_sample <- db_flights %>%
filter(!is.na(arr_time)) %>%
head(20000)
The linear_regression_db()
function does not use a
formula. It uses a table, and a named dependent variable. This means
data preparation is needed prior to running the model. The best way to
prepare the data for modeling will be using piped dplyr
operations.
db_sample %>%
select(arr_delay, dep_delay, distance) %>%
linear_regression_db(arr_delay)
#> # A tibble: 1 × 3
#> `(Intercept)` dep_delay distance
#> <dbl> <dbl> <dbl>
#> 1 -0.659 1.00 -0.00337
Adding a categorical a variable to a model requires prior data
transformation The add_dummy_variables()
appends a set of
boolean variables, one for each discrete value. This function creates
one-less discrete variable than the possible values. For example: if the
categorical variable has three possible values, the function will append
two variables. By default, add_dummy_variables()
removes
the original variable.
The reason for this approach is to reduce the number of database operations. Without this step, then a fitting function would have to request all of the unique values every time a new model run, which creates unnecessary processing.
db_sample %>%
select(arr_delay, origin) %>%
add_dummy_variables(origin, values = c("EWR", "JFK", "LGA"))
#> # Source: SQL [?? x 3]
#> # Database: sqlite 3.42.0 []
#> arr_delay origin_JFK origin_LGA
#> <dbl> <dbl> <dbl>
#> 1 11 0 0
#> 2 20 0 1
#> 3 33 1 0
#> 4 -18 1 0
#> 5 -25 0 1
#> 6 12 0 0
#> 7 19 0 0
#> 8 -14 0 1
#> 9 -8 1 0
#> 10 8 0 1
#> # ℹ more rows
In a real world scenario, the possible values are usually not known
at the beginning of the analysis. So it is a good idea to load them into
a vector variable so that it can be used any time that variable is added
to a model. This can be easily done using the pull()
command from dplyr
:
origins <- db_flights %>%
group_by(origin) %>%
summarise() %>%
pull()
origins
#> [1] "EWR" "JFK" "LGA"
The add_dummy_variables()
can be used as part of the
piped code that terminates in the modeling function.
db_sample %>%
select(arr_delay, origin) %>%
add_dummy_variables(origin, values = origins) %>%
linear_regression_db(arr_delay)
#> # A tibble: 1 × 3
#> `(Intercept)` origin_JFK origin_LGA
#> <dbl> <dbl> <dbl>
#> 1 9.62 -10.6 -7.79
One of two arguments is needed to be set when fitting a model with
three or more independent variables. The both relate to the size of the
data set used for the model. So either the sample_size
argument is passed, or auto_count
is set to
TRUE
. When auto_count
is set to
TRUE
, and no sample size is passed, then the function will
do a table count as part of the model fitting. This is done in order to
prevent unnecessary database operations, especially for cases when
multiple models will be tested on top of the same sample data.
db_sample %>%
select(arr_delay, arr_time, dep_delay, dep_time) %>%
linear_regression_db(arr_delay, sample_size = 20000)
#> # A tibble: 1 × 4
#> `(Intercept)` arr_time dep_delay dep_time
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -1.72 -0.000208 1.01 -0.00155
Interactions have to be handled manually prior the modeling step.
db_sample %>%
mutate(distanceXarr_time = distance * arr_time) %>%
select(arr_delay, distanceXarr_time) %>%
linear_regression_db(arr_delay, sample_size = 20000)
#> # A tibble: 1 × 2
#> `(Intercept)` distanceXarr_time
#> <dbl> <dbl>
#> 1 6.77 -0.00000197
A more typical model would also include the two original variables:
db_sample %>%
mutate(distanceXarr_time = distance * arr_time) %>%
select(arr_delay, distance, arr_time, distanceXarr_time) %>%
linear_regression_db(arr_delay, sample_size = 20000)
#> # A tibble: 1 × 4
#> `(Intercept)` distance arr_time distanceXarr_time
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -2.11 0.00269 0.00650 -0.00000435
Fitting a model with regular, categorical and interaction variables will look like this:
remote_model <- db_sample %>%
mutate(distanceXarr_time = distance * arr_time) %>%
select(arr_delay, dep_time, distanceXarr_time, origin) %>%
add_dummy_variables(origin, values = origins) %>%
linear_regression_db(y_var = arr_delay, sample_size = 20000)
remote_model
#> # A tibble: 1 × 5
#> `(Intercept)` dep_time distanceXarr_time origin_JFK origin_LGA
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -3.92 0.0132 -0.00000275 -10.1 -8.05