漫谈分布式系统:基本概念

450 阅读3分钟
原文链接: www.jianshu.com

0x00 前言

MIT 6.824课程第一节的部分内容,加上自己参考了一些书一起整理而得。

资源共享是构造分布式系统的主要动机! —— 《分布式系统:概念与设计》 第一章

0x01 什么是分布式系统

分布式的定义有很多,我们从两本书中摘来两个比较权威的定义来看,就不翻译中文了。

其一:

A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal.

其二:

A distributed system is a collection of independent computers that appears to its users as a single coherent system.

重要特性

分布式系统有一些特殊特性,这些特殊的性质也是我们在设计分布式系统需要考虑的主要问题。对这些问题的认知程度,直接影响到我们是否能理解为什么那么多分布式系统要被设计出来,它们到底要解决什么问题,每个系统在设计的时候有什么优点,然后它们做了什么取舍。

  • 组件并发性(concurrency of components)
  • 缺乏全局时钟(lack of a global clock)
  • 组件故障的独立性(independent failure of components)

反正我是感觉第三条占得分量挺重的。

为什么要用分布式系统?

  • to connect physically separate entities
  • to achieve security via isolation
  • to tolerate faults via replication
  • to scale up throughput via parallel CPUs/mem/disk/net
  • 为了连接物理上相互分离的实体
  • 为了通过隔离(isolation)实现安全性
  • 为了通过复制(replication)实现容错
  • 为了使CPUs/mem/disk/net可以实现扩容

有什么难题

困难当然很多,总的来讲会有这几点:

  • complex: many concurrent parts
  • must cope with partial failure
  • tricky to realize performance potential
  • 复杂性: 多个并发的部分
  • 必须处理部分失败的情况
  • 难以实现的性能潜力

这些困难怎么体现?

0x02 几个话题

围绕前面列的问题,我们来分别讨论几个话题:一致性、错误容忍、性能。这几个点的解决往往会引入新的问题,比如说为了避免集群中一台服务器挂掉导致丢数据,我们会使用多备份的机制,但是已引入多备份就又引入了数据不一致的问题。而为了保证数据一致性,就要多加入更多的处理逻辑,这就会导致性能的下降。

那么怎么样能设计一个完美的分布式系统呢? 这个我也不知道,先学着吧......

1. consistency(一致性)

Consistency is an issue for both replicated objects and transactions involving related updates to different objects (recall ACID properties)

一致性问题主要会集中在多备份和分布式事务中。

Achieving good behavior is hard!

  • "Replica" servers are hard to keep identical.
  • Clients may crash midway through multi-step update.
  • Servers crash at awkward moments, e.g. after executing but before replying.
  • Network may make live servers look dead; risk of "split brain".

Consistency and performance are enemies.

  • Consistency requires communication, e.g. to get latest Put().
  • "Strong consistency" often leads to slow systems.
  • High performance often imposes "weak consistency" on - applications.
    People have pursued many design points in this spectrum.

2. fault tolerance

1000s of servers, complex net -> always something broken. We'd like to hide these failures from the application.

What we want:

  • Availability -- app can keep using its data despite failures
  • Durability -- app's data will come back to life when failures are repaired

How: replicated servers.

If one server crashes, client can proceed using the other(s).

3. Performance

What we want: scalable throughput.

Nx servers -> Nx total throughput via parallel CPU, disk, net. So handling more load only requires buying more computers.

But Scaling gets harder as N grows. Why?

  • Load im-balance, stragglers. (Some node is much more slower than others. 慢节点)
  • Non-parallelizable code: initialization, interaction.
  • Bottlenecks from shared resources, e.g. network.

0XFF 总结

列了一些基本概念。不想搞成中文了,发现看英文挺能理解的,翻译中文,怎么都别扭。

参考


作者:dantezhao |简书 | CSDN | GITHUB

个人主页:dantezhao.com
文章可以转载, 但必须以超链接形式标明文章原始出处和作者信息