Nigel Kersten’s keynote talk was aimed at a pretty broad audience (a bit of Puppet, what’s driving uptake, etc.) but also described some of the new features in components included in the next release (IIRC) of Puppet Enterprise. I was particularly interested to learn about policy based auto-signing and trusted node data in Puppet 3.4+, external facts in Factor 1.7+, more readable ouput from Hiera 1.3+, and the news that Puppet Labs will be supporting some of their modules from the forge.
Peter Leschev from Atlassian described the process of introducing and
developing “infrastructure as code” in the Atlassian build engineering team. He
describe their introduction of a number of tools and measures and the impact on
confidence in infrastructure changes being made. It was interesting to see the
journey of adding code reviews, Puppet, Vagrant-based development (with
Veewee), behaviour based testing (with Cucumber), continuous integration
(Bamboo and Vagrant), profiling (Puppet’s --evaltrace
flag), automated
deployment (to staging) and notification (in HipChat). Later on I wished I’d
asked if the graphs of confidence in his slides were from measurements, or for
illustrative purposes only.
Lindsay Holmwood from Bulletproof described the Flapjack monitoring system – which seems pretty cool – and how you’ll be able to configure it with Puppet (when he releases the Puppet module). The architecture of Flapjack looked pretty interesting and I plan to have a play with it this weekend.
Rene Medellin spoke about NAB’s move to push some of their workloads into “the cloud” (AWS). They used Puppet as part of their SEO machine image building process and in deployment as one of their monitoring and compliance tools. Lots of Jenkins and automated building of AMIs and CloudFoundry templates and such.
Aaron Hicks from Landcare Research NZ spoke about the way he uses Puppet in a scientific research environment. Particularly interesting was the use of Puppet to formalise the configuration of the many, many precious snowflake machines used in the various research projects his organisation supports. The idea of supplying Puppet manifests to help in the replication of scientific computing sounds great.
James Dymond and John Painter from Sourced Group described a series of “Puppet in the AWS cloud” architectures they’d developed for clients in their consulting engagements. Most interesting was their fourth (I think) solution, where they implemented a “gateway” between AWS autoscaling notifications and Puppet, allowing the master to sign certificates, delete node reports, etc. as the AWS autoscaling system adds and removes nodes.
Matt Moor from Atlassian described the way they use Puppet to manage their SaaS offering. Each SaaS client has their own VM which, now, is managed using Puppet. This allows them to manage service and version dependencies much more reliably than their previous approach of building massive WAR files using Maven and managing them with hack-y shell scripts.
The last talk was by Chris Barker from Puppet Labs who gave a product demonstration of Puppet Enterprise. I’d already used most of the features demoed but some of the newer stuff – especially the event inspector – looked pretty cool.
Puppet Camp Sydney 2014 was a great event and brought to mind again just how much fun operations work (what little I’ve done) can be. In time, I expect the slides and videos of the presentations will be available from the Puppet Labs web-site on the Previous Puppet Camps page.
]]>Artur is a Senior Software Engineer at Yahoo!7. I think he said he’s on the platforms team? The environment within the team is rather different environment than many others – much more in common with release engineering and system administration than in other roles.
Everything is released and deployed as packages using a suite of tools and formats developed with the Yahoo! empire. Packages include (almost) everything: PHP source code, crontabs, configurations, etc.
Release descriptions (CMR) include:
When he joined the team, the 5 members were responsible for 180 packages (committing to 1-2 dozen packages in an average sprint).
There was a lack of visibility in not only the state of various packages (deployed versions, build and test status, etc.) but even which packages there are (commited to SVN but never made it into the package repository).
Problem with packages lingering without stable releases. Wanted to be able to recreate environments, etc. but dependencies not being promoted to stable can make it a pain in the arse to track down specific versions.
A great deal of manual work to assemble change management requests for releases. Two days of work at the end of each sprint, trawling through documentation, trackers, SVN, etc.
Ten different application clusters with different versions of different packages on each.
Perception was that the team was doing way too much manual work.
Constantly searching for information in disparate sources; repos, code, trackers, wikis, etc.
Ecosystem is too complex.
Too many moving parts & chances to screw things up.
Provide visibility
I don’t want to guess, nor search.
Automate
Do it for me or telll me what to do next.
Data aggregation
Single point of entry for Bugzilla, svn, ci, dist, CMR tool, etc.
Provide metrics
Built it over Christmas period.
Automated job to prcoess entire SVN repo, discover packages and generate 190 static HTML reports.
Second release using MySQL.
List of 190 packages. Sort by: CI state (broken at top), release state (commits but no version released), package created (but not deployed everywhere yet), up to date.
Provides information including:
Version numbers (svn trunk, newest in package repo, oldest in production)
“Score” (higher is worse) so it can fudge things by priority.
Links to various sources of information (related CMRs, SVN, CI, repo)
Rollup
Interrogates various data sources:
Assemble changelogs, etc.
Some packages are based on old CVS repositories, need crazy date-based logic to build a diff.
Dependencies between packages are really annoying; lots of dependencies between packages. 10 major applications, 190 packages. Only a few packages are relatively independent.
Provides overview of dependencies:
Metrics to tell:
How are things? Good or bad?
How are things changing? Getting better?
Lag-Score includes a range of factors (tests failing, production versions, etc.) which tries to combine all the factors. Plotted, making very little progress on this over 6 months.
Why a custom packaging tool?
It was invented at Yahoo! before there were existing tools like dpkg, rpm, etc. Lots of tools to manage, e.g., 40,000 servers involved in Yahoo! Mail.
Given the tools and scale, it probably won’t be going away.
Release notes: if it’s bullshit, why not kill it completely?
It’s an embedded part of the environment and culture of this team and other teams. Also: comes from global.
CMRs provide communication channel between teams and sysadmins. It’s a heavy process, and are trying to make it more lightweight, but safety is important.
How fast do you go?
About two release windows a week.
Sprints are about 3 weeks, but not religious about it.
SCRUM-ish, but no product owner, etc. so only ish.
Have you got your tool into other teams?
New version is in use by three or four more teams.
Internal presentation, now crawling all the things. Using maintainer information to group stuff into teams.
Are all envrionments managed in the same way?
Yeah, it’s all controlled using the same tools.
Reproducing production in staging for incident response?
Easy using the role-based server management system.
Configuration management in packages?
Packages declare the configuration options they have.
More
Command to override value for a configuration parameter declared by a package.
Changes to databases aren’t managed, managed manually. Sometimes have to make schema changes backward compatible and run before hand, etc.
A lot of this is about James having the shits with the way they do things at Yahoo!7 and on the web in general.
Working in Java, metric shit ton of frameworks. JBoss got deprecated.
Everything you can do with Tomcat is an awful hack.
Found data intensive server container. Based on Jersey but simple. Also: focussed on the web. Architecture three tier architecture.
Want more asynchronous: message queues, etc. Decoupling. Wrote a thing that does this. Similar architecture but more ways of asking for things to be done (cron, message queues, etc.)
I don’t recommend anyone ever write server middleware.
Erlang is erlang; Elixir is a ruby-ish language which compiles directly to Erlang bytecode.
Elixir Dynamo is a web framework for Elixir. Scaffolding, etc.
See example code.
Been to the US for the RedHat summit last month.
They’ll be releasing a major new version of Red Hat Satellite (their management thing) building on Puppet, Foreman, Katello, Pulp, Candlepin.
RHEL7 release is delayed. It’ll be based on Fedora 19 and the beta is due in December 2013. The 7.0 release is expected early next year. Replacing MySQL with MariaDB; adding MongoDB, nodejs; upgrading a bunch of programming languages; systemd. Will include client and server support for pNFS – an extension of NFS to be parallel.
Support get queries about rails apps, etc. Ask engineers but they are busy, etc. Support staff should be able to interrogate things.
Building on top of knife and knifeblock (manage knife configurations). Plugin allowing support staff to download application keys (to interact with APIs on their behalf), talk to APIs, generate knifeblock configuration and then help resolve issues.
# List apps.
knife ninefold-internal -l
# Generate knifeblock configuration.
knife ninefold-internal -a 23 -g
# Activate the knifeblock configuration.
knife block dev-NF00000004-23
# Do stuff to help investigate and resolve customer's problem.
knife ...
I’m typing these notes during the sessions, so there may be errors and omissions. Any such problems are my fault and not that of the speakers.
Slides are available on Speaker Desk
Confirmation bias: play devils advocate; us political bookmaking.
Negative views are often biased.
Resolving problems the provisioning teams were seeing with automation.
VMware is hurting us.
Then load balancing.
The VMware again.
Then EC2.
For 18 months.
: incorrectly recalling (rewriting history).
petty good argument that Conservative media should go behind pay walls.
You could have avoided bad circumstances but didn’t. Try harder.
Eg: devops ing the shit out of the alert that some you up last night (even of another one wakes you up twice add often).
Eg: higher conviction rates when prosecution sums up using hindsight language. Defence successful with foresight language.
A little bit of knowledge is a dangerous thing. Judging skill in something (in self and others) requires skill.
Setting an impossible deadline: “it’s just typing.”
Making decisions hard or impossible due to overload of options, knowledge, etc.
Poor performers don’t learn from feedback, because they think they know better.
I can X better than this.
Minimal training improves self-assessment, even in the absence of improved skill.
Non-technical management, lean on engineers.
East Asian societies seem to exhibit an inverted DK effect.
People believe they have above average susceptibility to good attributes.
Aesthetics affect perception of truth. You are more likely to believe them.
How many fs? Skipping the ones in words like “of”.
Induce randomness. Avoid s patterns the brain can fall into.
Checklist and formalise.
Make text styling simple. Simple fonts are more likely to answer a question correctly than in cursive font.
Ignore stuff which makes you uncomfortable about yourself. Organisation s have it bag.
Twain: it ain’t what you know…
See also:
Quantify your value to move up the ladder.
Devops:
Developers writing together with operations to get things done faster in an automated and repeatable way.
How do you know you’re getting it right? Pager quiet?
it worked fine in dev, ops problem now.
Nine nines is meaningless.
Grinding for a year on application support.
Business doesn’t care about P1s, SLAs, etc. All they care about is money. They could never really prove that P1=£
Every one viewed him as a pain in the arse.
The 4am call about a staging server.
False alarms costing $70k per year. Mean time to innocence.
How many people have considered:
How much have we saved the business?
How much have qr cost the business?
Automation
Collaboration
Visibility of the system
Business metrics. P1 is supposed to be business is impacted.
Baseline starting position.
Measure progress.
Calculate impact on business. Allows you to
promote success instead of problems.
Sell value
Monitoring and visibility tools.
Seeing utilisation, application performance monitoring.
Correlate business metrics.
Time is money. Business people like money.
Infrastructure automation with puppet, chef, etc. How much time did these tools save?
Deployment automation. Jenkins, Capistrano, err yc.
Log automation. Log stash, spunk
Graphite, nagios, etc
What is the value of collaboration?
Evaluate the cost of the tools and automation, etc. vs the savings. That’s your value as a practitioner.
Tell type business how much devops culture has saved them.
Sebastian is from Mi9 and Sam is a consultant from ThoughtWorks. Mi9 is a joint venture between Channel 9 and Microsoft. Run ninemsn.com.au which they’ve been trying to move to cloud-y sorts of things.
Main site moved to AWS, trying to move everything else to the cloud too. Not just AWS, also looking to Azure.
Using tools and techniques new to the business: Puppet, Linux (cost savings on licensing). 250 instances, 70/30 Windows/Linux, equally divided between Singapore and Sydney.
Windows administration tool of choice is Powershell. Have a lot of stuff already written in Powershell, don’t want to replace it.
Common pattern of Puppet File
resource and Exec
resource (with appropriate
unless
, etc. attributes). Interface between Puppet and the script is blurry.
Puppet Agent on Windows
Restarting Nagios client on Windows; Puppet Enterprise wasn’t able to restart services on Windows correctly, Puppet run turns the whole production infrastructure red.
Package management is crappy on Windows (find, download, run an MSI) vs unix (apt-get, yum, etc.); no consistent place to applications to store data and configuration for Windows apps; there’s rarely a single tool which can be used across both platforms (percountermonitor vs collectd, curl vs .Net class)
Ease the pain with Nagios (agents for both platforms), Chocolatey (attempt at package management for Windows), Graphite, Amanda (backups).
Initial move was a little wild-west; everyone had AWS keys, same account; couldn’t control access to specific resources (accidentally kill production instead of staging).
IAM federation is good, but some services (“beta”) like Beanstalk don’t support IAM federation.
Saw EC2 costs split between compute and network. Think about structuring networking. Shutdown all the things that aren’t tagged with “stay on all the time”.
Netflix Edda to inspect and record states of AWS resources. Hopefully be able to record changes that happen, with or without failures in change control.
First uses Puppet to push SSH keys out.
Goal to start using Active Directory (already using for OS-level auth).
Can be mismatch between using Puppet Master and continuous integration.
Code is committed.
Compile.
Tests pass.
It deploys!
Production.
Each of the stages may put the application into different environments – dev, test, staging, production. How does this work when the code is Puppet configuration.
Using Puppet environments to dev, test, prod Puppet code. Need to use Puppet 2.5 with changes. Need to be able to manage Puppet as part of the environments.
Solution: the Puppet master for each environment is part of that environment. Makes testing of changes to Puppet itself possible. No more breaking all the things.
Don’t want to double team for another platform. Want to use existing tools – Puppet – in Azure too.
Automating Puppet master deployment means being able to run a Puppet master within Azure.
Azure will be the 5th platform in use.
Windows loves Puppet, but Puppet (the development process) loves Windows quite a bit less.
What are you using?
Quite an array of platforms:
- News site is .Net; purchased CMS.
- NodeJS.
- Ruby on Rails apps.
- A few purchased Java apps.
- Older sites are classic ASP.
- Newer are .Net 2-4; rolling out 4.5
Puppet focusses on newer side of things (Amazon, .Net 4)
Can Cygwin help with the cross platform issues?
Started with this “paper over the differences” mentality but it just doesn’t work. There are corner cases where Cygwin isn’t like unix and you’ll have to touch real Windows anyway.
Also: it’s essentially a Windows team.
Also: The models aren’t the same: registry, OO controls, etc. You can’t just awk the registry. If you’re trapped in Cygwin, you can’t use apps that don’t know the Cygwin filesystem stuff.
Instead, use tools and processes which work on both platforms, rather than trying to pretend there’s only one platform.
Building higher-level tools which can support the different platforms, both OS and cloudish.
Trust
Processes: build processes that you – and your team – believe in; breaking the processes breaks trust. Don’t be the one who commits to master!
Being excellent isn’t enough; all the people should improve all the things all the time.
It’s hard to regain trust that you’ve broken.
Be visible so that your team – and other teams – can see what you’re doing.
@tomsulston
Really like failure. Once destroyed all the telephones in Glasgow with a single perl script.
Also likes schadenfrued: Zune, Vanilla Coke, Nokia nGage, Google Wave, betamax. All seemed like good ideas at the time.
These were all large failures. They didn’t fail soon enough; failing before your ship leaves port is a really good idea.
We have tools like Jenkins and continuous integration to fail early, before it goes like.
Fail fast, learn the lessons and don’t have massive projects blow up. Failing leads to deep, strong learnings; break cognitive biases. Failure is always an option.
Failcake: when you fail and something breaks, you have to buy the team cake. It makes the failure OK; it’s hard to be angry with a mouthful of cake.
Also: ThoughWorks Australia is hiring, go talk to one of them if you’re interested.
“Devops Doesn’t Work” but a few years later it’s in CIO Magazine, Gartner are looking into it, etc. Is this jumping the shark?
Job trends for technologies in, e.g., Puppet. Big enterprises which are trying to “buy” devops.
Devops as succeeding together
20 infastructures running different apps, etc. Standardisation and centralisation, but engineering teams won’t want to be able to see their environment and be able to change their environment. Keeping standardisation but allowing specialisation, versioning for specific configurations.
Possible:
Use heira and allow them to see their heira values. Would need to update to Puppet 3.0 to make that useful.
Possibly publish versions and such as facts in /etc/fact.d/ and expose the facts to them.
Is Puppet doing too much? Where’s the demarkation between system configuration and application configuration? Perhaps the version information, etc. belongs in the application repo rather than the Puppet configuration.
Diverse requirements: rubies (MRI 1.9.1, 1.9.2, 1.9.3, jRuby, etc.), databases (MySQL, PostgreSQL). All on Ubuntu and AWS.
A YAML file per environment (i.e. project) containing overrides with versions and the like.
Package the application and use the native package manager to handle the dependency and version requirements.
Perhaps: pre-baked AMI; cloud init script to apt-get
install the package;
configure details like DB credentials in Puppet, etc. Again: may be getting
Puppet to do too much.
The whole thing of reusable Puppet modules which are all things to all people is just rubbish.
Another suggestion (from Rio Tinto) of using Hiera with “project” layer for version pinning, etc. (Lots of modules are pre-Hiera.) Put logic into the Hiera tree to avoid conditionals in the manifests: common, $sdlc_env (capture test, stage, etc.), $site (DC, etc.)
Take existing Puppet 3.0 stack and adapt it for Windows. Doing it by
overriding a bunch of stuff in Hiera based on $os_family
.
To branch or not to branch Puppet modules and such.
I don’t always test but when I do, I do it live.
Avoid branching if we can – no divergence, etc. Some good workflows around, e.g. using the git sha as the environment name.
Maybe use a normal git workflow of dev, stage, prod branches.
Dude from Puppet Labs published a ruby script for synching branches into environments on the Puppet master.
Using buildbot with quiescent VMs to deploy pushed Puppet code and do functional testing. Want to add a noop run against changes and catch errors quicker than deploying to test machines.
I’m typing these notes during the sessions, so there may be errors and omissions. Any such problems are my fault and not that of the speakers.
Thanks to the gold sponsors: anchor, puppetlabs, realestate.com.au
Open Spaces is all unconferency. Un-organised or dis-organised, it’s our call.
There’s a function this evening with a bar tab, etc.
My notes from this talk are pretty sloppy. Sorry.
According to the programme this talk is the first devops days event which started ahead of schedule.
The talent shortage, if there is one, is unevenly distributed.
Puppet (and similar tools) were attempts to build a competitive advantage: organisations without it would be faced with a critical disadvantage. 2008 presentation slides include pictures of the Gatling gun.
Andrew joined the Puppet project as a developer; never worked in and wasn’t passionate about operations, and worked as a software developer (also not a passion).
Fascinated with the dynamics of high performance organisations and the individuals that comprise them. You often see sports teams of exceptional individuals who can’t play well together.
Reference to karate master being beaten at UFC-2. Mentally and physically unprepared for “combat”.
GM dominated the US car and truck market in the 1960s. Their executives visited Japan when their auto industry (with lean, just-in-time, etc.) was nascent and came away convinced that it wasn’t true because of the lack of inventory, stockpiles, etc.
Tools like CFEngine, Puppet, Chef, Jenkings, TravisCI, Vagrant, AWS, Docker. Books like Release It!, Continuous Delivery, Web Operations, Phoenix Project, Dev and Ops. The game has changed.
Devops is many things to many people. Elephants and blind men. Molesting the elephant in the room.
Working with organisations, etc. who ask “what should we do?” respond “we can’t do that.” And “who should we hire?” These people wind up thinking devops doesn’t work and we can’t hire the right people.
People often say “devops doesn’t work” or “agile doesn’t work” missing the fact that work is done by people, not abstract practices.
Maverick. Book about a guy who ran a company doing everything backwards.
Anecdote about is/has eaten the world.
Netflix.
The real comapny values are shown by who gets rewarded and promoted and who is let go.
You are either building a software business or you’re loosing to someone who is.
Either you’re building a learning organisation or you’re loosing to one that is. We need to incentivise learning within our organisations:
7 dimensions:
Continuous learning - create continuous learning opportunities.
Inquiry and feedback
Team learning - collaboration
Empowerment - avoid C&C hierarchies
Embedded systems - capture and share learning within teams and communities. Jargon, etc.
System connection - active effort to connect systems, within and without.
Strategic leadership
See the Dimensions of Organisational Learning Questionnaire.
Stop conflating “learning” with “training”. If you don’t experiment before you build the system, then the system is an experiment.
Learning happens within the process. Continuous integration & deployment work by providing feedback and learning within the process. Do the same thing with learning: continuous learning.
People intrinsically want to learn, be challenged, etc. Introducing some of these practices will result in people picking up or leaving (pushed too far out of their comfort zone).
http://www.devopsdays.org/events/2013-downunder/proposals/Devops%20Dungeons%20and%20Dragons/
Hi. My name is David and I’m a sysadmin. I’ve been on call (rosters) for 10 years.
Scenario One: Johnny’s first week in his first sysadmin job. When the phone rings at 3am, the web site is running slow so he reboots the servers. Causing a complete outage. Seeing a highload on the DB he reboots the DB server. He’s doing everything wrong; it’s a train wreck. “The site was a cluster fuck but it’s coming back up now.”
Scenario Two: John is an experienced sysadmin. The first thing John does is to communicate with the rest of the team: “I’m on it.” Then he looks at the change log (the developers probably broke something). He looks at some graphs; methodically gets a view of the state of system: 7s page loads instead of 5s. Look at the DB and see lots of connections from some servers, notice it’s caused by an external outage; disable that bit, log tickets with external and developer team to fix issues.
Johnny hasn’t fixed the problem so he’ll get worken up again in an hour.
Jo’burg has the highest rate of gun violence in the world. Their hospital is world renowned, interns come to learn from all over the world.
We need to practice.
Four stages to learning a new skill:
Unconcious incompetance - I don’t know what I don’t know. 6 days
Concious incompetance - I know what I don’t know. 6 weeks
Concious competance - I know it, but it’s hard. 6 months
Unconcious competance - I know it, and don’t have to think about it. 6 years
The purpose of training and practice is to reduce the time between the four stages.
Observing the world and making a mental model. Adults do this by reading, by observing others. Children learn by doing things.
Role-play, drills and games have been imporant in practice for centuries.
Practice dealing with emergencies: either at 3AM or scheduled.
Run them like a D&D campaign. Put team in a room for a few hours. Appoint a dungeon master and rotate the role regularly.
The DM plans the scenario before hand, and explains the problem. If you have a robust environment, break production. Monitor and track events during the course of the exercise. Conduct the postmortem.
Pass on knowledge by doing and practice!
Wouldn’t it be interesting to use this to interview people?
Hopefully this exercise will result in a reduction of MTTR.
Think about how you want your team structures. In D&D, a party of 4 dwarves wouldn’t work very well. Balanced teams are as important as balanced parties.
Can we distil and describe the attributes of team members like we do in roleplaying games.
Specialists have extremely high skills in one area.
Generalists have a wide range of skills but may not be expert in any particular field.
Just like a D&D party, a team need to be balanced and diverse.
Performing tasks should be a function of skill, not of job description.
More about microservices architecture than the traditional gigantic SOAP monster.
Anchor started in 2000 (no Twitter and Facebook, Google didn’t matter). They grew and needed to build systems (tcl and python tools talking to customer DB, rt, wiki, physical asset tracking, config management, etc.) Wind up with a complicated [set of] system, circular dependancies, etc. Plethora of interfaces direct Postgres DB access, RESTful, XML-RPC, etc. No integration testing.
Upgrading RT 3.8 to 4.0 broke almost everything; everyone has learned “don’t touch anything” (except Matt because he’s the boss). Stagnation.
Solution: rebuild with SOA
loosely-coupled RESTful APIs on all data.
Mandated consistent core behaviour for all APIs. Allows you to learn the whole system (rather than each part).
Conformance test suites; they are the documentation/spec.
New architecture is horizontal, with an API service for each functional unit.
Consistent interface to everything, easier to learn. RESTful, JSON, document formatting, common attributes, authentication, etc. Allows a service directory, common library infrastructure.
Talk about it incessantly until everyone is sick of the topic. Nut out all of the issues. Write a spec based on discussions.
Build an API based on it and discover the bits you missed. Iterate.
Build consumers, to help discover problems, etc.
Provide tutorials and examples for everyone to use. Unexpected use cases (vendor import process is broken, use the API instead and things work).
Provide client libraries for talking to your APIs.
Provide a framework for building more, additonal APIs. A lot of commonalities between APIs can be implemented in common too.
Provide lots of documentation, especially “getting started”.
Managed to cut across on time, in spite of a few teething problems.
The proof of a transition project is that you don’t go back.
Less division between support staff and developers. People working together, empowerment, etc.
Tools and systems to check and enforce consistency?
Small organisation, so social enforcement is reliable.
However, a lot of the consistency requirements are testable. E.g., common representations, attributes, etc. These sorts of issues are readily testable.
Anything that was too hard?
Haven’t found anything, yet. REST is good data and state changes and such.
A few situations with many-to-many relationships were tricky, but using consumer-focussed design to guide making these workable (possibly ignoring the underlying craziness).
Layering?
The API services are the single point of truth for specific types of data. Some access the same backends, but focus on different parts.
Limiting.
Built in load limiting and horizontal scalability from the start.
Organisations build systems which reflect their communication structures.
Yes. It is.
Versioning APIs?
One of the first things that was discussed.
Code uses semantic versioning. Responses all include software version information.
Clients can request specific versions.
Rules for deprecation of specific features, etc.
Did you consider available models for the data in your domain?
Yes, but there was nothing out there that felt right. Only needed 5% of OAuth, for example.
There are lots of APIs, almost all of them do things their own way. There aren’t any standards until you get to things like SOAP.
Teach someone to fish they eat for a day, etc.
Devops teams seem to self-limit sizes.
Flow and afforances for creativity. Env affordances are aspects which promote or enable actions.
Information radiators: dashboards, etc. Give info to the knowledgable and promote learning amongst others.
Popup classes
Brownbag classes: more formal.
Kata sessions: everyone brings a small (3 minutes) to share.
Dojos: longer, may involve pre- and post-work.
Hackdays: larger still, form adhoc teams, address problems.
Should share as much information in these processes as possible; enough to make you feel uncomfortable. Prevent the presence of high priests.
Don’t share your financial servers root password, but share that there are two and located in X and Y. They can learn from the architecture, etc.
Development background (from NICTA). Operation of software at scale in the cloud requires engineering specifically.
80% of outages caused by people/process maturity issues. Mitigations often cause or exacerbate large issues.
Log analysis, static configuration analysis, etc.
Treat operations as set of steps:
Three ideas:
Undo-framework and undo-ability of operations:
Model, track, and simulate operations:
Mine and model existing processes from log data:
Mine a process from existing log files.
Detect deviations early or help error detection. Presumably real-time mining to detect deviations from model, etc.
The Phoenix Project. Company in the book had a bus factor of one: Brent was the single guy who was critical.
Increasing the number of Brents in your organisation can be expensive to grow.
Look for people who are collaborative, passionate and love to share information. But Brent is still Brent, even with these people around. Allow Brent to work on big picture, important work (not fire fighting).
20 people, lucky to do deploy a week. Now deploy 5 distinct components a day.
Question: how do we remunerate people based on value they bring, rather than their job title?
Scrapping the towering stack of abstractions that is a app in a guest in a hypervisor on metal. See Erlang on Xen, golang circuit, etc.
But containerisation, Solaris Zones, etc.
Looks like this is the direction some stuff is going.
Cutting out overheads by passing network layers straight to applications (Intel’s drivers). But talking about optimising for performance is a bit silly when we’re running Ruby and Python.
But VMs give more than abstraction: separation, security. And OS engineers have done a lot in the last few decades.
Over arching question is: what are you optimising for?
Doing continuous delivery requires automation, push button, etc.
0VM based NaCl?
How spread it foot with devops? Essentially devops organisations are learning organisations.
From Maverick: measure everything wasn’t helping, just growing the number of people for numbers. Go from 12 layers of people to 3 layers.
Dunbar’s number (150) limits size of social graphs, so spilt company into business units. Build small clusters for products and make all the things for your work. Everyone learn all the machines.
Organisational structures. Often decisions are zero sum. If you treat IT as a cost centre then it always will be.
Incentivise fixing things (flat rate for on call, fix it and you get paid and get to sleep). Can be problematic with established roles, etc.
Technical debt has an organisational parallel. Doing kanban, etc. can help give value and measures to work, etc. Doing one point a week vs the five everyone else does, clearly there’s a problem.
Peter Senger, The Fifth Discipline.
First responder. Train every week, won trophies but sucked at fires. Training and learning aren’t the same thing; we learn because we want to, not because we’re in a class room.
Maverick: staff reviewed managers every six months, public.
Mastery: learn one new thing every day.
Training/learning: from Seven Samurai: “if we were using swords, I’d kill you.”
Using Puppet for 3 years, killed master and using fabric to push configs out and apply them as required. Pallet (Clojure), Ansible & Salt (Python), orchestration in Puppet. Wanting to unify the code for system configuration, harding, etc. and whole stacks (CloudFormation, etc.)
Did a spike of Chef, did a spike of Puppet. Puppet won.
Minimise resource usage in tweaking 10,000 instances by doing things like immutable servers. Don’t tweak 10k, just redeploy instances. animator to make AMIs.
Some of the configuration management tools will have/are having their lunch eaten by tools like CloudFoundry, BOSH, etc. Continuous delivery, configuration management, etc. are all coming together to result in a platform-oriented approach.
What’s the lead time between having an idea to live? All of these technologies – configuration management, platform management, orchestration, etc. – are about automating and minimising this delay.
Better chance to achieve “security” using automation, policy as code, etc. than with traditional pens ‘n’ paper security policies. Standardisation, consistency, monitoring, reporting, etc.
Vagrant for testing Puppet, continuous integration, etc.
Combine chef client and nanite over Rabbit MQ. Sounds kind of salt-ish.
Plugging all the things into MCollective and get a message queue by accident.
All the technologies are separate, do we need something that knows the system end-to-end? Hooking monitoring up to orchestration up to configuration management.
Telephone exchanges are feeble, monitoring software wedged the Glasgow phone system by running twice.
Cron job: cron running as root to clean up a directory; root’s $HOME is /. Three or four days.
HPUX: rm -rf followed symlinks; put a symlink to / in home directory.
/ full, move something big, /lib say, onto a separate partition.
New job; we need a UPS for all the servers. Configure network alerts but the switch wasn’t on the UPS.
Why is crontab -r so close to crontab -e? Or at least ask for confirmation.
]]>