It uses 'tidyeval' and 'dplyr' to run multiple cycles of kmean calculations, expressed in dplyr formulas until an the optimal centers are found.
simple_kmeans_db(
df,
...,
centers = 3,
max_repeats = 100,
initial_kmeans = NULL,
safeguard_file = "kmeans.csv",
verbose = TRUE
)
A Local or remote data frame
A list of variables to be used in the kmeans algorithm
The number of centers. Defaults to 3.
The maximum number of cycles to run. Defaults to 100.
A local dataframe with initial centroid values. Defaults to NULL.
Each cycle will update a file specified in this argument with the current centers. Defaults to 'kmeans.csv'. Pass NULL if no file is desired.
Indicates if the progress bar will be displayed during the model's fitting.
Because each cycle is an independent 'dplyr' operation, or SQL operation if using a remote source,
the latest centroid data frame is saved to the parent environment in case the process needs to be
canceled and then restarted at a later point. Passing the current_kmeans
as the initial_kmeans
will allow the operation to pick up where it left off.
library(dplyr)
mtcars %>%
simple_kmeans_db(mpg, qsec, wt) %>%
glimpse()
#> Rows: 32
#> Columns: 15
#> $ k_center <chr> "center_1", "center_1", "center_1", "center_1", "center_1", "…
#> $ k_mpg <dbl> 20.64286, 20.64286, 20.64286, 20.64286, 20.64286, 20.64286, 1…
#> $ k_qsec <dbl> 18.57357, 18.57357, 18.57357, 18.57357, 18.57357, 18.57357, 1…
#> $ k_wt <dbl> 3.072143, 3.072143, 3.072143, 3.072143, 3.072143, 3.072143, 4…
#> $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 1…
#> $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4…
#> $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8…
#> $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180,…
#> $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3…
#> $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150…
#> $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90…
#> $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1…
#> $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0…
#> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3…
#> $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1…