It uses 'tidyeval' and 'dplyr' to run multiple cycles of kmean calculations, expressed in dplyr formulas until an the optimal centers are found.
simple_kmeans_db( df, ..., centers = 3, max_repeats = 100, initial_kmeans = NULL, safeguard_file = "kmeans.csv", verbose = TRUE )
df | A Local or remote data frame |
---|---|
... | A list of variables to be used in the kmeans algorithm |
centers | The number of centers. Defaults to 3. |
max_repeats | The maximum number of cycles to run. Defaults to 100. |
initial_kmeans | A local dataframe with initial centroid values. Defaults to NULL. |
safeguard_file | Each cycle will update a file specified in this argument with the current centers. Defaults to 'kmeans.csv'. Pass NULL if no file is desired. |
verbose | Indicates if the progress bar will be displayed during the model's fitting. |
Because each cycle is an independent 'dplyr' operation, or SQL operation if using a remote source, the latest centroid data frame is saved to the parent environment in case the process needs to be canceled and then restarted at a later point. Passing the `current_kmeans` as the `initial_kmeans` will allow the operation to pick up where it left off.
#> Rows: 32 #> Columns: 15 #> $ k_center <chr> "center_1", "center_1", "center_1", "center_1", "center_1", "… #> $ k_mpg <dbl> 20.64286, 20.64286, 20.64286, 20.64286, 20.64286, 20.64286, 1… #> $ k_qsec <dbl> 18.57357, 18.57357, 18.57357, 18.57357, 18.57357, 18.57357, 1… #> $ k_wt <dbl> 3.072143, 3.072143, 3.072143, 3.072143, 3.072143, 3.072143, 4… #> $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 1… #> $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4… #> $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8… #> $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180,… #> $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3… #> $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150… #> $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90… #> $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1… #> $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0… #> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3… #> $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1…