Working Around Lack of HA in Nagios Core By Using An Inert Alert Record

One of the problems that we have with Nagios Core is trying to ensure that there is a machanism that our backup nagios server can use to determine that it needs to start publishing records to the alert database. When the primary host is not longer around or able to produce records into the alert database.

Most thinking about HA implements a heartbeat between the primary/backup hosts and this helps the backup determine that the primary is no longer available. For a hosted service like a web server or a VRRP default gateway, this makes sense. But this makes a lot less sense for a service host whose job it is to perform examination of other devices and hosts.

Rather than implementing a heartbeat between the two servers, we could simply implement a rally point in the alerting database that helps the backup host determine whether the primary is healthy. Let's call this an Inert Alert Record since the purpose of the record will not be to alert anyone... it's just a record.

The primary host would need to implement a flow to publish to this well-known inert alert record every minute. The backup host could check that inert alert record and, in the situations where it has determined that the primary host has not updated the record for N or more intervals, it would start upserting records in the alert database. When the primary comes back online, the alert record starts reflecting more recent upserts from the primary host and the backup host stops upserting records.

What's nice about using a pattern like this is that you don't need a lot of custom software or communication flows... instead we exploit the system of record we are concerned with: the alerting database. We also avoid issues where both hosts are able to reach the alert database but unable to reach one another... no split brain with both hosts clobbering one another's changes.

This allows us to handle the following types of cases:

  • primary host is down while backup host is not
  • primary host is unable to access the alert database while the backup host still has access

Downsides include:

  • the backup host has to monitor the inert record every minute
  • the backup host has to perform redundant checking against all monitored hosts and devices
  • does not solve for a situation where the primary host is able to reach the alerting database but unable to reach the hosts/devices it is responsible for monitoring

Playing with The KeePassXC TOTP Engine

KeepassXC has some built in two-factor TOTP (Google Auth app style) support. This can be handy when you’re pretty much only going to login to a page from a specific laptop.

I spent some time messing around with it this morning. The non-intuitive part is extracting the key. Most sites don’t volunteer this and may offer a manual setup which will provide the key. Alternatively, the key can be acquired using a QR code scanning app.

Depending on how you do this, moving to KeepassXC may result in things being less secure. Two factor is generally regarded to be “something you have and something you know”. But if you use a password manager (and KeePassXC is a password manager) and you store both the secret and the TOTP key in the same database, this can result in having both factors if your database gets popped.

The guidance is to use 2 databases. Of course, this makes it “something you have, and something else you have” which isn’t exactly the same thing… but for anyone using a password manager, that’s been the case for a while now hasn’t it?

An Apple Alternative

If your iCloud Universal Clipboard is in working order, you could just copy it on your phone and paste it on your computer.

Spotify Local Audio Files Problem: No devices are currently available for syncing on iOS 14

I found out last night that you can use your desktop instance of Spotify to share your own MP3s to your mobile device. The setup is a bit technical and not for the impatient.

Here’s the problem I ran into and how I fixed it. It worked on an old IOS 12 device right away but I wanted the files on my main phone which is running iOS 14.

On my desktop I had already:

  • enabled Local Audio Files in Spotify settings

    • added in the folders that have the MP3s I wanted to import

  • added those songs to a playlist

On my phone:

  • enabled Local Audio Files in Spotify settings

  • Chose to download the list that had the local MP3s

On my iOS 14 device I was seeing this error:

No devices are currently available for syncing.jpeg

The problem was that I hadn’t enabled Local Network access for the Spotify app on my iphone. Make sure that it is enabled:

Settings.jpeg

And once things are working you should see that One device is currently available for syncing:

One device available.jpeg

After that, the downloads proceed without further prompting.

Opinions and Observations about Systems and Software - State of Franco 2018

I started at working Salesforce in around this time 2014. So pretty soon I've spent about even time on two different teams over 4 years. It went very quickly.

My first couple years were heavy in Ruby using object-oriented paradigms and though I really enjoyed using Ruby and the many creature comforts that came with it, it also had a lot of drawbacks as a language. Using Ruby meant that you were choosing ease over performance. Attempts at using concurrency were pretty frustrating and complicated.

Around 2 years ago, I made my new team decision primarily based on 2 opportunities: the opportunity for learning functional programming practices and Clojure... and choosing to work with the guy I actually sat next to most of the previous couple years. Neither of these was a mistake.

During my first 2 years at Salesforce, I came to these conclusions:

  • Separate concerns when using structured data: Data in storage/transport should be as flat as possible. Deeply nested data creates tight bindings to structure where they shouldn't exist. A list of hashmaps is about as deep as it should get. Restructure data into an index when it needs to be accessed quickly.
  • All software is trade-offs: Rapid prototyping vs. maturity/scalability... Simplicity vs. Code Re-use... Get it working and don't over-generalize isn't a bad principle but assumes that the software will always be under active development and you will eventually get around to refactoring the code to allow it to grow into it's next phase.
  • Nearly all of my achievements are marked with pride and some amount of horror. Things that seem like minor decisions end up having long term implications. Things that were expected to solve a problem in the short term continue their service longer than expected.

I also had deep concerns about the simplicity of object-oriented coding but I couldn't have said much about it. Now, after a couple years of clojure and watching Rich Hickey videos, I think I can articulate it better. In a nutshell, it's really hard to use objects without getting wrapped up in how they work internally. This is one of the reasons I liked pry, which let you interrogate any object to get at it's methods and documentation in a REPL.

After letting myself immerse in clojure land for a couple years, I have a preference for functions that do data manipulations over objects. (I still haven't gotten fully used to namespaced keys but that is coming.)

Even when I write in python these days I tend not to use classes, prefering functions such as map and filter along with simple data collections like arrays and dicts (though I much prefer clojure's idempotent handling model which assumes that no data that you have a reference to will change underneath you without it being explicitly so in the form of an atom).

Controversially, I think that:

  • Object-Oriented ought not be the default model and needs to be justified for the expense it imposes. The expense it imposes is that you need to be aware of it's internals and given the choice between being handed data and being handed objects, I'd rather be handed data. We do need solutions for managing state and for code reuse.
  • Implicitness and indirection also need to be justified. You need to know what you're getting in exchange and it needs to be worth it. The sacrifices are often in readability. How much shuttling do you have to do between the top of your file/function/class and where you are? It all has a cost.

These are more weakly-held observations:

  • Any system that includes a repo with fragile tests will involve people trying to circumvent the repo with fragile tests or slow deploy cycles if these are not addressed quickly.
  • Some people believe data shouldn't live in repos but I do... especially if it's seed data for some infrastructure-as-code project. (Since most databases don't have a good notion of history or time.)
  • Minimize your deploy surface.

I think it's a good idea to look at facts and form opinions about what works and why. It's good to find out from others who are more believable than you what works for them and why.

I probably did a better job of talking about the what than the why here but my coffee is still kicking in this morning and I'm not really planning to edit this blog post. Which brings me to a final observation:

  • Something better than nothing.

JustWorks: Disabling AppNap to Prevent MacOS/OSX from Nerfing Terminal Programs

Welcome to another edition of #JustWorks, a series of blog posts on all of the customizations you need to make to MacOS to get around Apple's one-size-fits-all mentality.

If you've ever fired off a long-running program in terminal and locked your screen to go get a coffee with the hopes that the job would be done when you're back an hour later, you may have noticed that things take longer to run when your screen is locked or terminal is in the background.

MacOS did this for you! It did it to save power! And for whatever reason, it doesn't seem to care that you're plugged into the mains.

itjustworks.png

Here's how to disable something called AppNap for terminal. You're going to need to open a Terminal and paste this bit:

defaults write com.apple.Terminal NSAppSleepDisabled -bool YES

Then close all of your Terminals and restart it. You can verify the state of AppNap for Terminal on the Energy panel of Activity Monitor.

Screen Shot 2017-10-22 at 8.05.16 AM.png

Morning Reading Notes: Wednesday 2017-08-09

  • Microsoft dumps notorious Chinese secure certificate vendor | ZDNet - I'm just learning about TLS certificates at this stage for internal 2-way authentication purposes at work.  It's neat to see how much work goes into maintaining trust for Certificate Authority services.
  • The Guy Who Invented Those Annoying Password Rules Now Regrets Wasting Your Time - the publisher of NIST Special Publication 800-63. Appendix A comes out against our usual password practices as ineffective.  I'm sure it'll take the rest of the world a while to stop following that bad advice.
    • Hope for the future: "...the latest set of NIST guidelines recommends that people create long passphrases rather than gobbledygook words like the ones Bill thought were secure."

And for those following the controversy at Google on Engineer Damore bringing into question diversity initiatives at Google:

Clojure: Modifying a List of Vectors based on Previous Iteration Value

(EDIT: Looks like I muffed the title of this one but I'm going to leave it as is. Should read "Vector of Maps".)

Here's an interesting problem I've been pondering this morning. What is a good functional approach to solving a problem in which you have to iterate through a vector of maps and copy a value from a previous item if the value is nil in the current one?

Problem Statement

This is something that came up while trying to parse output from Cisco Nexus show ip bgp regex....

*|e172.16.1.1/32      192.168.1.1                                    0 65535 i
*>e                   192.168.255.255                                0 65535 i

And this results in a map that looks like the value of c here:

(def c [{:next-hop "192.168.1.1", :prefix "172.16.1.1/32", :route-status "*|e"}
        {:next-hop "192.168.255.255", :prefix nil, :route-status "*>e"}])

The goal of this code is to carry-over the value of the last seen prefix value to all subsequent values that have nil prefixes. In the nature of the output, the last-seen prefix value may not be from the iteration immediately before.

The return of our code is expected to look like this:

[{:next-hop "192.168.1.1", :prefix "172.16.1.1/32", :route-status "*|e"}
 {:next-hop "192.168.255.255", :prefix "172.16.1.1/32", :route-status "*>e"}]

Approaches

My initial approach was to initiate an atom in a let-binding for the carry-over value and to use map.

; map with atom
(let [prev (atom nil)]
  (map (fn [r]
         (if (and @prev (not (:prefix r)))
           (assoc r :prefix @prev)
           (do (reset! prev (:prefix r))
               r)))
       c))

Carrying over a value in a mutable atom isn't a very functional way of dealing with the problem so here are a couple more approaches I came up with.

The next approach uses reduce in which each iteration returns a compound accumulator as [<previous value> <accumulating vector>]:

; reduce with compound accumulator
(second
  (reduce
    (fn [[prev acc] v]
      (if (and prev (not (:prefix v)))
        [prev (conj acc (assoc v :prefix prev))]
        [(:prefix v) (conj acc v)]))
    [nil []]
    c))

What I like about this approach is it's very clear about what's going into the next iteration. There isn't any mutable state. What I don't like is having to deal with the compound accumulator and having to unpack the result of the reduce using second.

My final approach uses loop/recur:

; loop/recur
(loop [prev nil
       acc []
       coll c]
  (if-let [v (first coll)]
    (if (and prev (not (:prefix v)))
      (recur prev (conj acc (assoc v :prefix prev)) (rest coll))
      (recur (:prefix v) (conj acc v) (rest coll)))
    acc))

When I first looked at this I thought it was a bit awkward but it's growing on me. The loop's base case is when there are no more items in coll - return the accumulated result. And the recursing code will either update prev or append a modified value.

One might think that this approach may suffer from limits on stack depth but as I understand it, loop/recur has some special handling which makes it not a problem.

Github Gist

https://gist.github.com/francisluong/967b5528e4cd64e2d6a34553c677f5cd

OSX: Lock Screen Shortcuts

I'm not a fan of the hot corner method of locking screen so here are 2 other ways to achieve screen lockage on MacOS / OSX:

Keychain Access: Toolbar Icon So You Can Click To Lock Screen

  • CMD-space: Launch "Keychain Access"
  • -> Preferences
  • CHECK: "Show keychain status in menu bar"

Then you can lock screen by clicking the padlock on the Menu Bar and selecting "Lock Screen":

 

Automator and Keyboard Shortcut

Automator Service: Lock Screen via Applescript

First we have to bind an applescript to a service that starts the screensaver:

  • CMD-space: Launch "Automator"
  • File->New
  • Service

 

  • Change "Service recieves" dropdown to "no input"
  • use search bar to search for "run applescript"

 

  • Doubleclick "Run Applescript" to add it to the service.
Doubleclick "Run Applescript" to add it to the service.

Doubleclick "Run Applescript" to add it to the service.

 

  • Copy/Paste this code into the Applescript box replacing any existing text:
  • Save as: "Lock Screen"

 

Keyboard Shortcut

Next we bind a key to the new service.

  • CMD-space: Launch "keyboard" (or System Preferences -> Keyboard)
  • Shortcuts (top bar) -> Services (left pane) -> Lock Screen (right pane - scroll to find it)
keyboard: navigating through to "Lock Screen" service.

keyboard: navigating through to "Lock Screen" service.

  • Add Shortcut (I added CTRL-CMD-K in this example)
keyboard-final

keyboard-final

Ruby Asynchronous HTTP and EM::Iterator

I recently had a need to issue a number of long-running HTTP gets that were taking about 20 minutes to complete when performed serially, one after the other.  In spite of the concurrency limits imposed by the Global Interpreter Lock, Ruby is well-equipped to handle long-running I/O.  

Using the methods documented below I was able to squash that 20 minutes down to less than 5 minutes.  I probably would have been able to do better if the server was able to handle the full load of all requests concurrent.

Part of the reason I am writing this is because of the additional amount of tire-kicking I needed to do to get this working even with the the documentation that comes with em-http-request.  (Thank you for posting your lovely gem... Your documentation is close but not quite there for me.)  

Simple Example - Single Request

Because I prefer a number of examples with increasing complexity when I am tackling a new library, that's what you'll get here. This first example implements only a single HTTP request but establishes that we have figured out enough to determine that it is working for a single get.

It produces output that looks like this:

$ ruby bin/em-http/em-http-000.rb 
D, [2016-07-24T17:12:39.386698 #42654] DEBUG -- : [] [url=http://ipv4.download.thinkbroadband.com/5MB.zip] [SUBMITTED] [runtime=0.197928]
D, [2016-07-24T17:12:42.238997 #42654] DEBUG -- : [] [url=http://ipv4.download.thinkbroadband.com/5MB.zip] [CALLBACK 200] [runtime=3.050257]
{"SERVER"=>"nginx", "DATE"=>"Sun, 24 Jul 2016 21:12:39 GMT", "CONTENT_TYPE"=>"application/zip", "CONTENT_LENGTH"=>"5242880", "LAST_MODIFIED"=>"Mon, 02 Jun 2008 15:30:42 GMT", "CONNECTION"=>"close", "ETAG"=>"\"48441222-500000\"", "ACCESS_CONTROL_ALLOW_ORIGIN"=>"*", "ACCEPT_RANGES"=>"bytes"}

So far, so good. We download from a URL and it is successful with a status code of 200.

Next Example: Crude Async Concurrency

The next example ensures that we have a solid enough understanding to deal with two concurrent requests. One thing we have to handle is ensuring that EM.stop is only called after all of the jobs have completed. So in this example, we add a request queue collection and a method call to EM.stop when all jobs are done.

It's getting a bit more complex but still quite grokkable and we can see that both requests get submitted at the same time, but the larger file takes longer to download.

$ ruby bin/em-http/em-http-010-async.rb 
D, [2016-07-24T16:47:21.628277 #42542] DEBUG -- : [new_http_request] [url=http://ipv4.download.thinkbroadband.com/5MB.zip] [SUBMITTED] [runtime=0.18803]
D, [2016-07-24T16:47:21.631735 #42542] DEBUG -- : [new_http_request] [url=http://ipv4.download.thinkbroadband.com/10MB.zip] [SUBMITTED] [runtime=0.191508]
D, [2016-07-24T16:47:26.295605 #42542] DEBUG -- : [new_http_request] [url=http://ipv4.download.thinkbroadband.com/5MB.zip] [CALLBACK/ERRBACK 200] [runtime=4.855387]
D, [2016-07-24T16:47:26.295678 #42542] DEBUG -- : [stop_when_all_finished] [states=[:finished, :body]]
D, [2016-07-24T16:47:30.669914 #42542] DEBUG -- : [new_http_request] [url=http://ipv4.download.thinkbroadband.com/10MB.zip] [CALLBACK/ERRBACK 200] [runtime=9.229693]
D, [2016-07-24T16:47:30.669999 #42542] DEBUG -- : [stop_when_all_finished] [states=[:finished, :finished]]

Final Example: Async Concurrency with Concurrency Limits

If we have about 20 requests, we probably don't want them all going to a server and blowing it up. EventMachine provides an elegant mechanism to apply a concurrency constraint so that we can limit the number of active requests. In this final example, I issue six requests and I apply a concurrency limit of 2.

The code itself doesn't look very different. But you can watch the log and see that two jobs are submitted initially and then subsuquent jobs are added only as a job completes.

$ ruby bin/em-http/em-http-020-async-with-em-iterator.rb 
D, [2016-07-24T16:55:37.620212 #42569] DEBUG -- : [new_http_request] [url=http://ipv4.download.thinkbroadband.com/5MB.zip] [SUBMITTED] [runtime=0.174998]
D, [2016-07-24T16:55:37.624177 #42569] DEBUG -- : [new_http_request] [url=http://ipv4.download.thinkbroadband.com/10MB.zip] [SUBMITTED] [runtime=0.178991]
D, [2016-07-24T16:55:42.003892 #42569] DEBUG -- : [new_http_request] [url=http://ipv4.download.thinkbroadband.com/5MB.zip] [CALLBACK/ERRBACK 200] [runtime=4.5587]
D, [2016-07-24T16:55:42.003984 #42569] DEBUG -- : [stop_when_all_finished] [states=[:finished, :body]]
D, [2016-07-24T16:55:42.006724 #42569] DEBUG -- : [new_http_request] [url=http://ipv4.download.thinkbroadband.com/5MB.zip] [SUBMITTED] [runtime=4.561536]
D, [2016-07-24T16:55:45.318885 #42569] DEBUG -- : [new_http_request] [url=http://ipv4.download.thinkbroadband.com/10MB.zip] [CALLBACK/ERRBACK 200] [runtime=7.873685]
D, [2016-07-24T16:55:45.319011 #42569] DEBUG -- : [stop_when_all_finished] [states=[:finished, :finished, :body]]
D, [2016-07-24T16:55:45.321175 #42569] DEBUG -- : [new_http_request] [url=http://ipv4.download.thinkbroadband.com/10MB.zip] [SUBMITTED] [runtime=7.875988]
D, [2016-07-24T16:55:46.646499 #42569] DEBUG -- : [new_http_request] [url=http://ipv4.download.thinkbroadband.com/5MB.zip] [CALLBACK/ERRBACK 200] [runtime=9.201303]
D, [2016-07-24T16:55:46.646617 #42569] DEBUG -- : [stop_when_all_finished] [states=[:finished, :finished, :finished, :body]]
D, [2016-07-24T16:55:46.651357 #42569] DEBUG -- : [new_http_request] [url=http://ipv4.download.thinkbroadband.com/5MB.zip] [SUBMITTED] [runtime=9.206166]
D, [2016-07-24T16:55:50.993760 #42569] DEBUG -- : [new_http_request] [url=http://ipv4.download.thinkbroadband.com/5MB.zip] [CALLBACK/ERRBACK 200] [runtime=13.548572]
D, [2016-07-24T16:55:50.993836 #42569] DEBUG -- : [stop_when_all_finished] [states=[:finished, :finished, :finished, :body, :finished]]
D, [2016-07-24T16:55:50.995672 #42569] DEBUG -- : [new_http_request] [url=http://ipv4.download.thinkbroadband.com/10MB.zip] [SUBMITTED] [runtime=13.550481]
D, [2016-07-24T16:55:52.958938 #42569] DEBUG -- : [new_http_request] [url=http://ipv4.download.thinkbroadband.com/10MB.zip] [CALLBACK/ERRBACK 200] [runtime=15.51375]
D, [2016-07-24T16:55:52.959023 #42569] DEBUG -- : [stop_when_all_finished] [states=[:finished, :finished, :finished, :finished, :finished, :body]]
D, [2016-07-24T16:55:59.708914 #42569] DEBUG -- : [new_http_request] [url=http://ipv4.download.thinkbroadband.com/10MB.zip] [CALLBACK/ERRBACK 200] [runtime=22.263717]
D, [2016-07-24T16:55:59.709023 #42569] DEBUG -- : [stop_when_all_finished] [states=[:finished, :finished, :finished, :finished, :finished, :finished]]

Conclusion

Thus ends my tutorial on using em-http-request and em-iterator to handle a number of long-running downloads with concurrency. A lot of the lines I have above are devoted to logging and comments. The code is actually quite concise, I think.

If this was helpful to you, share it along.


Clojure and Values (vs. Places)

I let my workmate craigy talk me into checking out Clojure. This is also partly because of all of the love Paul Graham heaps on Lisp in his writing.

I have been kicking tires using vim and lein repl for the past couple days but this morning I decided to take it to the next level.

I finally watched Rich Hickey's Keynote: The Value of Values. This doesn't help with Clojure directly but it really helps to get clear on what a value is and what it isn't and why you want your software to work with values. Really worth a watch and it will challenge the way you look at code and data.

After watching this video, I set about trying to get a decent dev environment setup so that I can experiment with the code without having to repaste into a REPL. For me, the best way is to get a solid IDE setup and use unit tests for the tire kicking. RubyMine has been really good for me, and I decided to use IntelliJ CE with the Cursive plugin. There's still a good bit of configuring you need to do and it doesn't fully feel native but it was better than my feeble attempts at using Eclipse.

Thank you to JetBrains and Ideogram for making non-commercial versions available.

References

Hacking on Chrome Extensions

I found myself waiting yesterday. Nothing seemed to be working out. I have a code review that isn’t really moving. And the lab resources weren’t in a usable state so I was rebuilding them.

And so it happened that I spent much of yesterday hacking on Chrome extensions. It's published to the Chrome web store and the source is on github.

The Assignment

I gave myself a short assignment to add button that gives me the html text of the work ID and description of a work item wrapped in an a-href, automatically copied to the clipboard.

This is a repeat-pattern that shows up in my work because I track my active tasks in Evernote so that I can get one-click access to the items on our internal Salesforce instance, which we use to track just about everything.

Usually this involves 6 steps:

  • copy the work ID text
  • paste into Evernote using plain text paste (CMD-SHIFT-v)
  • copy the description text
  • paste into Evernote
  • copy URL
  • select the text in Evernote, and apply URL to the text (CMD-k, CMD-v)

It doesn't take that much time to do this process manually but it's not a good use of time either and requires that you setup a couple of windows side by side.

A “Short” Assignment?

So I did what I usually do which is to spend a chunk of time trying to figure out if I can make this easier, and there was the potential that my effort could fail yielding nothing.

Onward! It was an hours-long exercise in google-fu and debugging.

I had to learn about how chrome sandboxes different parts of the extension to keep the browsing secure.

Chrome has this tendency toward asynchronous calls with callbacks and I’m not quite getting how to return values out to my main logic area. So I’m not proud of the code because I just got around it by nesting a lot of closures. But it works for now and I’ll pretty it up later.

Wasting Time?

You might consider this to be wasting time. It is certainly not productive in any way that I am measured in my job.

But, this is who I am.

What did I do over the years to acquire the skills that I possess in automation? I took the long and hard way. The way with some pricey-looking up-front costs.

You have to be willing to do things badly.

You have to be willing to do things slowly.

Paraphrasing Paul Graham: innovation tends to be heretical.

I was not very productive yesterday. And I didn’t even write elegant code.

And yet, I enjoyed the way I used my time yesterday. I’ve always been willing to do things the slow way up front so that I can do it methodically via code. It’s an investment and generally a more enjoyable experience than brute copy-paste efficiency.

Password Recovery for Cisco Nexus NXOS 7.0

Looks like getting into the kickstart for password recovery has changed a bit with NXOS 7.x. NXOS 7.x features a consolidated binary image rather than the previous pair of kickstart and system images we are used to seeing on Nexus 3000.

It's not exactly obvious how one boots to the kickstart. So here's the new sequence:

Step 1 - Power Cycle and Interrupt to the Loader with CTRL-L

When you see this:

Press  ctrl L to go to loader prompt in 2 secs

Press CTRL-L

Step 2 - Set recoverymode=1 and boot your NXOS 7.x image to get the Kickstart

First, ensure you can find your NXOS 7.x binary image using dir.

loader> dir
...
bootflash:
  nxos.7.0.3.I2.2a.bin

Then force boot to kickstart and boot the image.

loader> cmdline recoverymode=1
loader> boot bootflash:nxos.7.0.3.I2.2a.bin
Booting kickstart image: bootflash:nxos.7.0.3.I2.2a.bin
 Image valid
INIT: version 2.88 booting
Skipping ata_piix for n3k.
...

This should land you at the switch(boot)# prompt and you can follow usual procedures from there.

Travis Deployment to Rubygems.org

After working on it rather un-seriously for a number of months, I decided to finally learn how to publish a ruby gem today. You can thank my workmate, Craig for this.

And since I am lazy, I also made sure to have Travis CI do it automatically for any releases I tag on Github. Here is the result:

We have a success folks! So I thought I'd document the process to help others trying to do something similar.

How to Get Your Gems Flowing to RubyGems.org from Travis CI

Step 1 - Setup an account on Rubygems.org

You will need an account, and, more

Step 2 - Setup your Rubygems.org api key

This tip comes from the Make Your Own Gem guide at rubygems.org.

To setup your RubyGems API key, do the following (and be sure to substitute your rubygems username where you see the ${USERNAME} field).

curl -u ${USERNAME} https://rubygems.org/api/v1/api_key.yaml > ~/.gem/credentials; chmod 0600 ~/.gem/credentials

Step 3 - Use the Jeweler Gem

We use the jeweler gem to handle versioning and structure creation/update of the .gemspec manifest so that we don't have to do these things by hand. The jeweler gem adds tasks to rake and I found it really easy to get setup and customized for my gem.

The readme on the github pretty much had everything I needed.

The section labeled "Customizing Your Gem" is especially important since it offers a modifiable section of code you can drop in your rake files:

require 'jeweler'
Jeweler::Tasks.new do |gem|
  # gem is a Gem::Specification... see http://guides.rubygems.org/specification-reference/ for more options
  gem.name = "whatwhatwhat"
  gem.summary = %Q{TODO: one-line summary of your gem}
  gem.description = %Q{TODO: longer description of your gem}
  gem.email = "josh@technicalpickles.com"
  gem.homepage = "http://github.com/technicalpickles/whatwhatwhat"
  gem.authors = ["Joshua Nichols"]
end
Jeweler::RubygemsDotOrgTasks.new

Here is the jeweler customization for expect-behaviors.

Once jeweler is setup, we now have a rake driven flow to build the gemspec file so that you can bump a version by doing:

  • rake version:bump:patch
  • rake gemspec

You will have to decide for yourself how you feel about rake release. I plan to use Travis CI to deploy so the release part is less important to me.

Also important: it was really useful for me to build the gem locally because it turns out the gem builder really didn't like the way I wrote some of my dependencies. To do this, run gem build ${GEMSPEC_FILE}.

Step 4 - Get Travis CI up and running

I'm assuming that if you're reading this, you're already familiar with the basics of Travis CI. But in case you need help to get going, I would refer you to their docs:
https://docs.travis-ci.com/user/getting-started/

You want to be up and running with tests before you move on the to next step. If it helps to see an example of how little it takes to configure Travis, you can see my .travis.yml file here. The important sections are covered on lines 1-9 for basic testing. Creating the deploy section is covered in the next step.

Step 5 - Install the Travis Gem and setup rubygems

  • gem install travis
  • travis setup rubygems

Dead simple. That's why we love Ruby.

You will be prompted for everything else. Here is what I saw:

$ travis setup rubygems 
Gem name: |expect-behaviors| 
Release only tagged commits? |yes| 
Release only from francisluong/expect-behaviors? |yes| 
Encrypt API key? |yes|

Step 6 - Travis will attempt to deploy when you draft a new release on Github.

But make sure to bump the revision in the commit too.

Here's what I think my flow will be:

  • As part of my branch/pull-request, I will include a rake version:bump:patch (or :major or :minor as appropriate)
$ rake version:bump:patch  
Current version: 0.1.2  
Updated version: 0.1.3

Cisco Nexus 3000 Loader and Kickstart - Getting Unstuck

I found myself in a situation at work where after a downgrade, my Cisco Nexus 3000 switches were sitting at a loader> prompt with no readable images on the bootflash. Here are some things I noticed while trying to get unstuck.

Network Connectivity for Loader and Kickstart

If you end up in a jam where your switch cannot boot to a system image, you will need one of the following to get full network access:

  • a bootable kickstart/system image on the boot flash (assuming it is readable... mine was not)
    • a bootable kickstart and system image on a USB drive which has been connected to the USB port of the switch
    • a TFTP server that can be reached via the Management Ethernet port (MGMT) on the switch if you apply an IP address and, if needed, a default gateway (no dynamic routing).

In this last option, I stress that The MGMT port is the only port you can configure with an IP address in loader or kickstart.

We don’t normally have the MGMT port cabled up, so I needed to reach out to the Data Center engineers to get them move a cable from a normal switch port to the MGMT port. Once I did that, I was able to from the loader to kickstart.

Examples of interaction for loader and kickstart

TBD... I will update this when I get my log files off my work computer for these interactions.

The Case For An Out of Band Management Ethernet

All of this jockeying to try to get network access when the switch is down hard... This makes it very clear that the management interface is special.

There is generally value to being able to reach the switch independent of the routing protocols running on the switch. And if hands-off recovery of a switch that is not able to boot is a requirement, having a management ethernet network built out is a prerequisite.

Alternatively, If you have a solid way of ensuring that a spare switch can be swapped in for a failed one, and that switch will either have or will be configured with a valid configuration, then the case for building out a management ethernet network is less.

Either of these will address the problem for this scenario. You probably don't need both. It's important to remember that having an out of band management ethernet network is a solution to one or more problems. And if you have alternate solutions, it may make sense not to have one.

That being said, a management ethernet network is also useful for stats, logging, and controlling the switch in a manner which is independent of the switch's routing state and access-control-lists applied to the normal switch ports. So there are other scenarios for which being able to swap in doesn't get you the same level of functionality.

OSX: Changing the Screenshot save folder

OSX tip: Create a folder on your desktop called “Screenshots”, then use this method to change the default screen shot folder to the one you just added.

defaults write com.apple.screencapture location $HOME/Desktop/Screenshots; killall SystemUIServer

Declarative Network Automation: Config Scope Tags

I'm going to spend a bit more time writing about network automation. If networking is not part of what interests you, I'll try to make certain that you can tell from the title of my posts that it will not be of interest.

Problem Space

When defining configuration, I hate to repeat myself. Copy and paste is useful but has its risks. I get snagged when I'm doing a paste-and-modify but I forget the second step. Also, if you ever have to change the configuration in the future, any repeated data has to be fixed in many places.

In an ideal setup, we want to be able to declare configuration for each device in an autonomous system with as little repetition as possible.

This implies hierarchy but an important observation is that very few network architectures fit the neat and tidy mold of being able to be concisely defined in terms of a tree. More specifically, we can't assume that the parents of a tree tell us anything about what the children will be like. For example, a fabric switch may have leaves that are of type TOR-A and TOR-B.

arbitrary-leaves.jpg

Proposed solution

It would be nice if we could declare a set of scopes that a device belongs to and those are used to determine the bulk of its common configuration with other similar devices.

Also, so that we may reason about how the configurations will be built deterministically, the scopes themselves must be defined with a property that indicates the order of application (or broadness/narrowness of scope).

Here I propose the definition of a new concept: "Config Scope Tags". Every device config declaration must begin identifying the list of memberships to Config Scopes.

Arbitrary Assignment

Scope tags are defined and then arbitrarily assigned to each device. They can be used in a hierachical fashion but are not restrictive to any particular tree shape or method of inclusion. In this respect, they are similar to community tags in BGP.

An example set of Config Scope Tags applicable to a data center environment may be:

  • All Sites
  • Site Name
  • Build/Architecture Revision
  • Device Class - e.g. TOR/Fabric
  • Device Model#/SKU/FRU
  • Fabric Bank
  • Pod ID

Property: Order

In order to achieve minimal repetition, each scope tag will have a property for order which establishes the order in which template variables are applied. 0 is the broadest of scope and values are applied first so that they may be overridden by more specific data. Subsequently higher broadness values added to the configuration hash in order.

0 should be reserved as a special value for All Sites. Applicable to all devices in all domains. This is the only global scope.

The narrowest scope is the device specific configuration.

Example

Implicitly I was conceptualizing this as hierarchical YAML and I express my example in this way. But it can be stored in any form.

scopes.yml

This is a definition of config scope tags and their order values.

tor-5a.yml

This is a sample configuration for a TOR switch.