It uses 'tidyeval' and 'dplyr' to run multiple cycles of kmean calculations, expressed in dplyr formulas until an the optimal centers are found.
Usage
simple_kmeans_db(
df,
...,
centers = 3,
max_repeats = 100,
initial_kmeans = NULL,
safeguard_file = "kmeans.csv",
verbose = TRUE
)
Arguments
- df
A Local or remote data frame
- ...
A list of variables to be used in the kmeans algorithm
- centers
The number of centers. Defaults to 3.
- max_repeats
The maximum number of cycles to run. Defaults to 100.
- initial_kmeans
A local dataframe with initial centroid values. Defaults to NULL.
- safeguard_file
Each cycle will update a file specified in this argument with the current centers. Defaults to 'kmeans.csv'. Pass NULL if no file is desired.
- verbose
Indicates if the progress bar will be displayed during the model's fitting.
Details
Because each cycle is an independent 'dplyr' operation, or SQL operation if using a remote source,
the latest centroid data frame is saved to the parent environment in case the process needs to be
canceled and then restarted at a later point. Passing the current_kmeans
as the initial_kmeans
will allow the operation to pick up where it left off.
Examples
library(dplyr)
mtcars %>%
simple_kmeans_db(mpg, qsec, wt) %>%
glimpse()
#> Rows: 32
#> Columns: 15
#> $ k_center <chr> "center_1", "center_1", "center_1", "center_1", "cen…
#> $ k_mpg <dbl> 20.64286, 20.64286, 20.64286, 20.64286, 20.64286, 20…
#> $ k_qsec <dbl> 18.57357, 18.57357, 18.57357, 18.57357, 18.57357, 18…
#> $ k_wt <dbl> 3.072143, 3.072143, 3.072143, 3.072143, 3.072143, 3.…
#> $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8…
#> $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4…
#> $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146…
#> $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, …
#> $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92…
#> $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.1…
#> $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.…
#> $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1…
#> $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
#> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4…
#> $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1…