Tools used in system Design

In this chapter we shall look into different tools that are used in system design. Below is the compilation of the tools and a brief 1 or 2 points of each tools.

List of topics discussed:

1. Containers
2. Schedulers
3. CDN
4. Message Queue
5. Monitoring
6. Load Balancer
7. Logging software
8. Data processing
9. NoSQL DB
10. RDBMS
11. Proxy servers
12. Version control
13. Machine learning
14. cache server
15. Big Data Database
16. Search and analytics

1. Containers

A container at it’s basic is a virtualization software that can be installed on any base operating system. By doing so, you create a sandbox to run a specific software. This helps to deploy the application on different environment effortlessly. Below are some of the containers available in the market.

Docker:

Docker is a software container, sits on host operating system. Yo can deploy number of applications isolated from each other.
Read more at:

Home

AWS ECR:

Amazon web service, Elastic Container Registry. This is a commercial container, which can be used to deploy and manage docker container images. ECR is integrated with Amazon Elastic Container Service, to simplify the workflow.
Read more at:
https://aws.amazon.com/ecr/

Kubernetes:

Kubernates is an open source system for continious deployement, scaling and management of multiple containerized applications.
Read More at:
https://kubernetes.io/

CoreOS:

CoreOS is a linux container. It provides infrastructure to clustered deployments. It uses fleets for cluster management and etcd for service discovery.
Read More at:
https://coreos.com/

2. Scheduler

Once you have multiple containers running, it will be difficult to upgrade or to make a similar chainges to all the hosts. Hence in those scenarios, schedulers are used to load the files and services to a host. They hae a default scheduline policy, this determines how the services are scheduled without giving input from the administrator.

Docker swarm:

Doker swarm is a scheduling tools specifically for docker containers. Docker swarm can run in a cloud, it orchestrates Docker containers that are there in cloud.

Yarn:

YARN stands for Yet Another Resource Negotiator. It is Hadoop’s cluster resource management system. In yarn there are 2 types of hosts.
Resource manager: This daemon service communicates with client and assisgns task to Node Managers.
NodeManagers: It launches and tracks processes on worker hosts.

3. CDN

CDN stands for Content Delivery Network. These servers are used to speedup the resources of your website. Consider you are watching youtube from India, now to get the indian content as fast as possible, the content should be near you. So youtube places india content near India servers, similiarly, it places USA content near USA servers.

Below are some of the companies that provides CDN services:

1. Cloudflare

It is one of the popular CDN service provider. It also provides additional services like DDoS mtigation and internet security.

https://www.cloudflare.com

2. CoralCDN:

It is open and free peer-to-peer content distribution network. To enable just append “.nyud.net” to end of image or video url, and automatically will be handeled by CoralCDN.

http://www.coralcdn.org/

3. jsDelive:

It is a free public CDN for open-source projects. It hosts mirror for npm, GitHub, wordpress, plugins.

https://www.jsdelivr.com/

4. Message queues:

Message queues is a compenet used for inter process communication or inter thread communication. It works on publisher/subscriber model as asynchronous model.

Apache ActiveMQ:

It is open source MQ written in java. It implements fumm Java Message Service. It can be used to communicate between two distributed processes.

http://activemq.apache.org/

Apache RocketMQ:

RocketMQ is a low latency, fast distributed messging and streaming platform. It supports features like:
1. Log hu for streaming
2. Data integration
3. FIFO and strict ordered messaging

https://rocketmq.apache.org/

RabbitMQ

RabbitMQ is built on Advanced Message Queueing Protocol. It is lightweight and easy to deploy on cloud.

https://www.rabbitmq.com/

5. Monitoring

Monitoring software is used to check status of multiple servers. Monitoring is also called as health checks, to identify down or non rechable servers.

Zookeeper:

Zookeeper amoung other things is used to maintain configuration information, and group services to various protocols in distributed applications. It is used to maintain scynhronizarion amount the processes that the zookeeper is maintaining.

https://zookeeper.apache.org/

Eureka:

Here there will be one eureka server, multiple eureka client will be connected to eureka server. All the eureka clients needs to send heartbeats, server will check the heartbeat and determine the client health status.

https://github.com/Netflix/eureka/wiki/Eureka-at-a-glance

6. Load balancing:

Loadbalancer (LB) is a device that is used to serve multiple requests to multiple servers registered to LB. LB sits between client and server. It also provides protection from DDOS attacks and provides reability.
There are 3 different Loadbalancer:

1. Hardware Based LB
2. Cloud Based LB
3. Software Based LB

Seesaw:

It is a software linux based load balancer server. It is developed on GO language. It works with 4th layer of OSI model.
https://github.com/google/seesaw

LoadMaster by KEMP:

It is software based LB and is also free and paid for advanced features. It provides layer 4 and layer 7 load balancing.
https://freeloadbalancer.com/

HAProxy

It providdes HA, TCP/HTTP LB. It supports IPv6 and Unix socket. This is suitable for very high traffic website.

https://www.haproxy.org/

7. Logging software

Logging software is used gather logs from multiple microservice application. Log management is used to collect indexing and analyzing both structured and unstructured data.

Logstash

Logstash is a very light-weight opensource data processing engine. It will collect the data from many sources and transofrms it. The data goes through 3 stages.
Input: It is where the data gets collected.
Filters: It will process the data by applying filters.
Output: It wil output the data to a file or elasticsearch or any other tools.

https://www.elastic.co/products/logstash

Graylog

It is enterprice level log management system. It has multiple features like:
1. Dashboards
2. Multi threaded search
3. Fault tolerance

Home

8. Data processing

Apache Spark:

It is fast in-memory big data processing engine. It uses RDDs to delegate the task to smaller to another nodes.
https://spark.apache.org/

Apache pig :

This allows analyzing of large data sets. It uses map reduce algorithms to analyze large sets of data.

https://pig.apache.org/

Apache hive

Apache hive is built on top of apache hadoop. The data that is in distributed storage is queried using SQL syntax.

https://hive.apache.org/

9. NoSQL DB

In NoSQL DB the data will not be stored in any tabular structure. There will be no relation between the data that are placed. They are used to store big data and real time web applications. SOme of the type are:

Key-value pair
Document storage.
Graph storage.
Tabular storage.
Tuple storage.

MongoDB:

MongoDB is a document based NoSQL DB. IT is highly scalable and stores the data in JSON-like documents. It can be scaled horizontally. It is also open source.

https://www.mongodb.com

Apache CouchDB

CouchDB is a NoSQL DB for document based DB. The data will be stored in JSON format in key-value pair.

http://couchdb.apache.org/

Apache Cassandra

Apache cassandra is open source distributed wide column based NoSQL DB. It provides fault tolerance on cloud infrastructure for mission critical data.

http://cassandra.apache.org/

10. RDBMS

RDBMS stands for Relational Database Management System. Here the data will be stored in tables, those table will be in relationship with one another. SQL is the most common language to access the data.

MySQL

It is opensource RDBMS software. It is written in C and C++. It provides performance, reliabiliry and is very easy to use.

https://www.mysql.com/

MariaDB

MariaDB is opensource DB. It is used as a replacement for MySQL DB. It supports ACID properties.

MariaDB Foundation

SQLite:

SQLite is a small, fast, high reliability SQL DB. It is written in C and is opensource. Android OS uses SQLite DB.

https://www.sqlite.org/index.html

11. Proxy servers

A proxy server is a software system that sits between client and outside internet. Usually it is used as cache system. If the requested resource is already present in the server, it will serve from its cache. It reduces internet bandwidth as it is stored locally.

Squid:

Squid is a HTTP based web proxy. It saves a copy of the documents, and it will load the saved documents on repated requests. Hence saving the load, access times and bandwidth consumption.

http://www.squid-cache.org/
https://github.com/squid-cache/squid

Varnish

Varnish is also HTTP based web proxy. IT is used in content heavy dynamic websites.

https://www.varnish-cache.org
https://github.com/varnishcache/varnish-cache

12. Version control

When you are writing a source code, it is common that the souce coude goes through multiple iterations of changes. Tracking them is very difficult. Hene we use source code management system.

GIT

GIT is DVCS [ Distributed Version Control System]. Here the code will be stored in everyone local machine. When the user is ready to commit, then the updated data will be reflected in the code. If the server is down, the user will be having a local copy where he can continue his work.
https://git-scm.com/

SVN

Unlike GIT, SVN all the code will be stored in a central repository. The user will download a file make a changes and upload the data to central server. If the central server is down, then the user will not be able to work.
https://subversion.apache.org/

13. Machine learning

Apache system ml

This tool is used for ML using big data. It can run on top of Apache spark.

https://systemml.apache.org/

BigML

BigMl is a machine learning as a service. It has a easy to use interface.

https://bigml.com/

14. cache server

A cache server is used to store web pages locally. So when there is a similar request it will send from the locally stored DB.

Memcached

It is free, open source distributed caching system. It is used to speedup dynamic web applications.

https://memcached.org/

Redis

Redis is a in-memory data structure store used cache. It supports datastructure like strings, hashes, lists etc. It supports LRU cache eviction.
https://redis.io/

Varnish cache

Varnish is HTTP caching system, it stores pages that is served. It has many advantages like Less CPU resources, website optimization.

https://varnish-cache.org/intro/

15. Big Data Database

Apache Hadoop

Apache Hadoop is opensource distributed storage system. It is used to store big data and it can be integrated with other apache tools to make data analytics easier.

https://hadoop.apache.org/

16. Search and analytics:

Elastic search

Elastic search is a distributed test based search engine. It is built upon apache lucene.
https://www.elastic.co/

searchblox

Search blox is enterprise search and analytics engine. ALong with search it also helps in analyzing complex data from multiple sources.

https://www.searchblox.com/