机器学习之层次聚类

569 阅读3分钟
原文链接: mp.weixin.qq.com

层次聚类

聚类是将样本进行归类形成K个簇,层次聚类是其中的一种方法。它将数据组成一棵聚类树,过程可以是凝聚形式或分裂形式。

核心思想

凝聚是一开始将每个样本当做一个聚类,接着通过计算将距离最近的两个聚类合并,成为新聚类,每次合并聚类总数减少一个,不断循环合并操作,直到所有聚类合并成一个聚类或当聚类数量到达某预定值或当聚类直接距离达到某阀值后停止合并。而分裂则与凝聚相反,一开始将所有样本当做一个聚类,每次分裂一个聚类,直到满足某条件。

算法步骤

  1. 计算n个样本两两之间的距离

  2. 构造n个簇

  3. 找到最近距离的两个簇并合并,簇个数减少1

  4. 循环遍历找出距离最近的两个簇并合并,直到满足终止条件

距离可以是两簇最小距离、最大距离、均值距离或平均距离

代码实现

import pylab as plfrom operator import itemgetterfrom collections import OrderedDict,Counter
dataSet = [[1,1],[3,1],[1,4],[2,5],[1,2],[3,2],[2,4],[1,5],[11,12],[14,11],[13,12],[11,16],[17,12],[12,12],[11,11],[14,12],[12,16],[17,11],[28,10],[26,15],[27,13],[28,11],[29,15],[29,10],[26,16],[27,14],[28,12],[29,16],[29,17],[29,13],[26,18],[27,13],[28,11],[29,17]]
clusters = [idx for idx in range(len(dataSet))]
distances = {}for idx1,point1 in enumerate(dataSet):  for idx2,point2 in enumerate(dataSet):    if (idx1 < idx2):
      distance = pow(abs(point1[0]-point2[0]),2) + pow(abs(point1[1]-point2[1]),2)
      distances[str(idx1)+"to"+str(idx2)] = distance#order by distancedistances = OrderedDict(sorted(distances.items(), key=itemgetter(1), reverse=True))
groupNum = len(clusters)
finalClusterNum = int(groupNum*0.1)while groupNum > finalClusterNum:
  twopoins,distance = distances.popitem()
  pointA = int(twopoins.split('to')[0])
  pointB = int(twopoins.split('to')[1])
  pointAGroup = clusters[pointA]
  pointBGroup = clusters[pointB]  if(pointAGroup != pointBGroup):    for idx in range(len(clusters)):      if clusters[idx] == pointBGroup:
        clusters[idx] = pointAGroup
    groupNum -= 1wantGroupNum = 3finalGroup = Counter(clusters).most_common(wantGroupNum)
finalGroup = [onecount[0] for onecount in finalGroup]
dropPoints = [dataSet[idx] for idx in range(len(dataSet)) if clusters[idx] not in finalGroup]
cluster1 = [dataSet[idx] for idx in range(len(dataSet)) if clusters[idx]==finalGroup[0]]
cluster2 = [dataSet[idx] for idx in range(len(dataSet)) if clusters[idx]==finalGroup[1]]
cluster3 = [dataSet[idx] for idx in range(len(dataSet)) if clusters[idx]==finalGroup[2]]
pl.plot([eachpoint[0] for eachpoint in cluster1], [eachpoint[1] for eachpoint in cluster1], 'or')
pl.plot([eachpoint[0] for eachpoint in cluster2], [eachpoint[1] for eachpoint in cluster2], 'oy')
pl.plot([eachpoint[0] for eachpoint in cluster3], [eachpoint[1] for eachpoint in cluster3], 'og')
pl.plot([eachpoint[0] for eachpoint in dropPoints], [eachpoint[1] for eachpoint in dropPoints], 'ok')
pl.show()

结果 

直接用机器学习库方便

import numpy as nu
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering

dataSet = [[1,1],[3,1],[1,4],[2,5],[1,2],[3,2],[2,4],[1,5],[11,12],[14,11],[13,12],[11,16],[17,12],[12,12],[11,11],[14,12],[12,16],[17,11],[28,10],[26,15],[27,13],[28,11],[29,15],[29,10],[26,16],[27,14],[28,12],[29,16],[29,17],[29,13],[26,18],[27,13],[28,11],[29,17]]dataSet = nu.mat(dataSet)
clusterNum = 3cls = AgglomerativeClustering(linkage='ward',n_clusters=clusterNum).fit(dataSet)
markers = ['^', 'o', 'x']for i in range(clusterNum):
    members=cls.labels_==i
    plt.scatter(dataSet[members,0],dataSet[members,1],marker=markers[i])
plt.show()

结果 



欢迎关注: