Hi,
I have opened a couple of threads asking about kmeans performance problem
in Spark. I think I made a little progress.
Previous I use the simplest way of KMeans.train(rdd, k, maxIterations). It
uses the "kmeans" initialization algorithm which supposedly to be a
faster version of kmeans++ and give better results in general.
But I observed that if the k is very large, the initialization step takes a
long time. From the CPU utilization chart, it looks like only one thread is
working. Please see
https://stackoverflow.com/questions/29326433/cpugapwhendoingkmeanswithspark
.
I read the paper, http://theory.stanford.edu/~sergei/papers/vldb12kmpar.pdf,
and it points out kmeans++ initialization algorithm will suffer if k is
large. That's why the paper contributed the kmeans algorithm.
If I invoke KMeans.train by using the random initialization algorithm, I do
not observe this problem, even with very large k, like k=5000. This makes
me suspect that the kmeans in Spark is not properly implemented and do
not utilize parallel implementation.
I have also tested my code and data set with Spark 1.3.0, and I still
observe this problem. I quickly checked the PR regarding the KMeans
algorithm change from 1.2.0 to 1.3.0. It seems to be only code improvement
and polish, not changing/improving the algorithm.
I originally worked on Windows 64bit environment, and I also tested on
Linux 64bit environment. I could provide the code and data set if anyone
want to reproduce this problem.
I hope a Spark developer could comment on this problem and help identifying
if it is a bug.
Thanks,
[image: ]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
<http://about.me/davidshen>
