Welcome to the documentation

Build Status Hugo Version GitHub release GitHub contributors License: MIT

Site reliability engineering is a software engineering approach to IT operations.

With SRE, 100% reliability is not expected; failure is planned for and accepted.

SRE takes the tasks that have historically been done by operations teams often manually, and instead gives them to engineers or ops teams who use software and automation to solve problems and production systems.

SRE helps teams find a balance between releasing new features and making sure that they are reliable for users. SRE helps to improve the reliability of a system today, while also improving it as it grows over time.

A site reliability engineer is a unique role that requires either a background as a software developer with additional operations experience, or as a sysadmin or in an IT operations role that also has software development skills.

Automation is an important part of the site reliability engineer’s role. If they are dealing with a problem repeatedly then they will automate a solution. This also helps ensure that operations work remains at half of their workload.

Maintaining the balance between operations and development work is a key component of SRE.

Getting Started

Content overview

Bash script

Bash script is always a usefull tool to finish your work, it is the most basic automation tool you should master.

Python APP

Python is powerfull and easy to learn, you can easily automate complex tasks. If you dont know what language to select, try python.

life is short, I use python.

Golang

Golang is much faster than most script language, if you need performance, you need golang. Concurrency handling is no langer the problem.

Kubernetes

Containerize your system, let K8S help you deploy and update the cluster. A workman must sharpen his tools if he is to do his work well.

Monitoring

Dont let your client tell you your system is not working well. Use proper monitoring tools watch and analyse the status of you system, get yourself alterted befor anything actually goes wrong.

APM

Application Performance Monitoring is always import for large system, when you expand your bisiness, know your system as well as the bisiness.