A Django site.
June 25, 2008

Phil Windley
pjw
Phil Windley's Technometria
» Velocity 08: Puppet In-Depth and Hands-On

The final talk of the day (hope I make my flight) was by Luke Kanies of Reductive Labs on Puppet.

Most automation tools are based on SSH and as a result, they suck. The problem is that the intersection of administrator and developers is very small. Luke wanted Puppet to be so good it was like bringing a gun to a knife fight. The goal: manage lots of machines with very little effort.

Luke makes an analogy about the transition from assembly to C and moving from commands and files to "resources." Resources are abstract and portable. Abstraction is the most important idea here. Why do we have to know how to, for example, update packages on Fedora and Debian.

Packages are the basic unit. You can install, uninstall, update, etc. packages. There are 23 package types. Resources say what to do for a given package. Here's an example for SSH:

class ssh {
  package { ssh: ensure => installed }
  file { sshd_config:
    name => $operatingsystem ? {
      Darwin  => "/etc/sshd_config",
      Solaris => "/opt/csw/etc/ssh/sshd_config",
      default => "/etc/ssh/sshd_config"
    },
    source => "puppet://server.domain.com/files/ssh/sshd_config"
  }
  service { ssh:
    name => $operatingsystem ? {
      Solaris => openssh,
      default => ssh
    },
    ensure    => running,
    subscribe => [Package[ssh], File[sshd_config]]
  }
}

The class provides intent; by creating a class for ssh, you're saying it should be installed and running. Note that the installation, configuration, and service all have their own definitions.

Nodes allow you to associate hosts, by type, with resources. Transactions make sure everything is in it's correct state. Transactions are idempotent--they don't have an effect if machines are in the right state. Idempotency allows management throughout the lifecyce. Puppet should manage a machine from it's birth to death.

Tags: velocity08 puppet automation infrastructure

June 24, 2008

Phil Windley
pjw
Phil Windley's Technometria
» Velocity 08: Even Faster Web Sites

Steve Souders of Google is speaking on Even Faster Web Sites. I've read Steve's book and loved it. It's the kind of book you read in the morning, use to make changes to your site in the afternoon and at the end of the day, you've made a huge difference.

Usually, a small percentage of the time (10%) a browser spends putting a page in front of the user is spent downloading the HTML document. Making the Web server faster might save compute time or storage, but it doesn't do much for the user's perceived response time.

80-90% of the end user response time is spent on the front end of the page load experience, so start there. You'll have a greater potential for impact. The changes are simpler than backend tuning. And finally, they've been proven to work. I can personally vouch for that.

Steve created rules or high performance Web sites and built them into a Firefox extension called YSlow. Running in on www.windley.com, shows that I don't do so well. I get a grade of "F." I'll have to work on that!

Steve's writing a new book that focuses on some new rules. Javascript is the place to focus. Javascript requests can have a big impact. They block parallel downloads because once they start executing nothing else happens. Steve created Cuzillion to allow people to create test pages easily that show browser behavior for particular choice.

Steve shows a table that shows that Facebook, for example downloads 1Mb of javascript and only executes 9% of them. Of the top ten sites on Alexa, the average is 252Kb with 26% of the functions executed. S

The first new rule is to split the initial payload. Split JavaScript between what's needed to render the page and everything else. The is largely something you have to do manually. Finding all the possible code paths is hard.

The next rule: avoid script blocking. There are numerous techniques to avoid script blocking. Most of these require some refactoring of the JavaScript code because you're downloading and then using a technique to eval what got downloaded. On technique that doesn't is appending the script after the page has loaded. One big difference between these techniques is how they affect the busy indicators. Some people get nervous when the page is "still loading" even when it's rendered and downloading scripts. On the other hand, if the status bar says "done" and the page hasn't rendered completely people will reload unnecessarily.

Long inline scripts block rendering and download. You can initiate execution with a setTimeout, move it to an external script, or use the defer attribute (IE only). Don't scatter them in the page and don't put it between the stylesheet and any other resource.

Tags: velocity08 web+performance

» Velocity 08: Storage at Scale

Google's reliability strategy is to buy cheap hardware with no reliability features and create reliable clusters from them because no problem Google wants to solve fits on a single machine anyway.

The Google File System (GFS) is a cluster file system with a familiar interface, but not POSIX compliant. Bigtable is a distributed database system. This has a custom interface, not SQL. There are 100's of instances of each of these cells scaling in to 1000's of servers and petabytes of data.

in the GFS, a master manages metadata. Data is broken into chunks (64Mb) and multiple copies (typically three) of a chunk are stored on various machines. The master also handles machine failures. Failures are frequent when you use lots of commodity hardware. Checksumming detects errors, replication allows for recovery. This all happens automatically. Higher replication is used for hotspots.

Most data is in two formats:

  • RecordIO - sequence of variable sized records
  • SSTable - sorted sequence of key/value pairs

BigTable is built on top of GFS. Lots of semi-structured data ordered by URL, user-id, geographic locations. The size of the data sets varies widely (e.g. page data vs sat-image data).

Tables are broken into tablets. These are treated as chunks for replicating in GFS. When a tablet gets too big, it's broken in two. Tablets go into SSTables in GFS.

Scale is important. Envisioning how to create exabyte systems. The systems need to be more and more automated. The number of systems is growing faster than Google can hire.

Tags: storage google scaling velocity08

» Velocity 08: Some Tools for Improving Web Performance

HTTPWatch is an HTTP traffic viewer for Internet Explorer. There's a free basic edition, but the professional edition is almost $300! Whew! Firebug for Firefox, of course, remains free.

Fiddler is a Web debugging proxy that runs on a local port on your PC. It registers itself as a system proxy so it should work with most browsers (Firefox needs special configuration apparently). You can also point a browser on another machine at the proxy running on your PC. Wonder if you could do this in Fusion? It would probably work fine. You can also use it to monitor outgoing traffic for funny programs phoning home. Breakpoint debugging seems like a cool idea.

AOL Pagetest is an IE plug-in for measuring and analyzing Web page performance. Pagetest is free. One cool feature, it provides a checklist to see if your page is taking advantage of several well-known "best practices" for making Web pages speedy.

Firebug is a Firefox extension that is used for measuring and debugging Web sites. The Profile button profiles Javascript on the page and gives time by function. Statistical analysis can help with functions that have different behaviors depending on input. John Braton gave a demo of how to use Firebug profiling to optimize a page.

Tags: velocity08 web+performance browsers

» EUCALYPTUS - Elastic Utility Computing Architecture for Linking Your Programs To Useful Systems

Rich Wolski from University of California, Santa Barbara is speaking about an open source implementation of cloud computing that has an interface compatible with Amazon's EC2 called Eucalyptus.

Rich does research on grid computing. He's been looking for the "open source" cloud. He mentions Numbus (Univ. of Chicago) and Enomalism. But nothing came close to what they wanted: Linux image hosting ala Amazon.

By choosing to make their interface compatible with EC2, they take advantage of all the client side tools that work with EC2 to manage machines in Eucalyptus. They want one-button install of their system on top of a cluster of machines.

The goals:

  • Foster research in cloud computing
  • Create a vehicle for experimentation prior to buying commercial services
  • Provide a debugging and development platform for EC2
  • Provide a development platform for the open source community
  • Not designed as a replacement technology for EC2 or other cloud services

Challenges:

  • Extensibility - simple architecture and open internal APIs
  • Client side interface - Based on the EC2 WSDL and 2008 compliant except for static IP address assignment and security groups. There's no public information on system administration of the cloud, so Eucalyptus provided it's own interface for that.
  • Networking - VPN per cloud. Public IPs are scarce, so all cloud images have access to a private network interface, but not public interfaces.
  • Security - authentication and authorization. All Eucalyptus components use WS-SEcurity for authentication. Intercomponent messages are not encrypted by default. SSH key generation and installation 'ala EC2 is implemented.
  • Packaging, installation, maintenance - uses Rocks. They want to change this.

Lessons learned:

  • Open source for cloud computing constrains design more than they thought it would. Local configuration choices provide real challenge.
  • No one in the real world still build clusters by hand.
  • There are few cloud computing configuration tools available.

Tags: velocity08 automated+infrastructure amazon web+service cloud+computing

» Velocity 08: High Performance AJAX Applications

Julien Lecomte from Yahoo! is speaking about creating performant AJAX applications. The most important point: plan for performance from day 1. Interestingly many of his initial points are about telling the developer to work with the product manager and not just say "no."

Julien references an Web Site Optimization: 13 Simple Steps by Stoyan Stefanov. Here's some tips:

Less is more. Don't do unnecessary things.

Break rules. Make compromises and break best practices when needed. For example, you might decide to forgo CSS. Especially CSS expressions.

Work on improving perceived performance. Cheat by making users think things are done before they are.

Measure performance. Test using a setup similar to the user's configuration. Profile your code during development. Automate profiling and performance testing. Keep historical records of how feature perform.

Minify CSS and Javascript files. Use something like the YUI Compressor. Stay away from compression schemes that require run time compression. You can also combine the CSS and JAvascript files. Optimize images.

Loading and parsing HTML, CSS, and JavaScript code is costly. Be concise and write less code. Make good use of libraries. Splitting JS libraries into bundles for specific uses might save time.

Close HTML tags. Unclosed tags take longer to parse. Load assets (even images) on demands.

Most DOM events can be accomplished before the onload even has fired. You can also load the scripts after the page has fully loaded.

In JavaScript a lookup is done each time a variable is accessed. Declaring variables with the var keyword and making them local helps. Avoid global variables at all costs. Avoid with. You can use a local variable to "cache" the value of a variable outside the current scope when it's going to be accessed repeatedly.

Limit the number of event handlers. Attaching a even handler to hundreds of elements is very costly and can be the source memory leaks.

Reflows happen when the DOM tree is manipulated. You can minimize reflows by taking advantage of browser built-in optimizations. For example, modifying an invisible element doesn't trigger reflow.

Use onmousedown instead of onclick to take advantage of the time between the start of the button press and the release.

Avoid using JavaScript for layout. Use CSS where possible.

Never resort to a synchronous XHR. Asynchronous programming is more complicated but it's worth it. Deal with network timeouts programmatically.

If you validate user data on the browser, 99.9% of the time, the request will succeed, so lock the affected elements, let the user know something's happening, and process the request while the user continues to use the application.

Use JSON rather than XML. Consider local storage and just process diffs. Multiplex AJAX requests where you can.

Tags: velocity08 ajax performance browsers

June 23, 2008

Phil Windley
pjw
Phil Windley's Technometria
» Velocity 08: Actionable Logs

Mandi Walls from AOL is talking about creating actionable logs. Actionable logs are logs that provide data that can be used to fix problems. There are a few rules to start with:

  • No nonsense logging
  • Concise, easy to understand
  • Express symptoms of productions issues
  • Any that makes the log needs to be somethings that can be fixed (better signal to noise ratio)

Everytime you write to a log file, you're expending resources. The point of logging in production is diagnosing issues. You need to be able to understand the logs at 4am in the morning.

The primary goal is diagnosis and recovery of problems. Secondary goals include statistics and monitoring, insight into application behavior, and indicating potential problems. Note that these are different than the goals of development and QA logs.

Logs come in different flavors: access logs, server logs (e.g. Catalina), application logs, and special use logs for groups of activities.

Some hints:

  • Log locations should be predictable and obvious. You may want logs on different disk partitions (this stops full file systems from crashing the server). Keep old log files in an obvious place as well.
  • Rolls logs into files with timestamps in the names.
  • Logs should be human readable and easy to parse. Use real dates and times. Unix timestamps don't pass the 4am test. Good timestamps give you the ability to link server activities to external events (like network outages).
  • Create a common format for multiple applications where possible.
  • Use one line per logs message where ever possible.
  • Avoid the use of only numerical error codes in them.
  • Put URLs to external info in log messages where appropriate
  • Be consistent about severity. Saying something's "fatal" without more data isn't helpful.
  • Log at the first point the error is encountered. If a server is processing 100,000 requests per minute, waiting a minute to log something means there's lots of data in between the problem and the log entry.
  • Actively manage and prune logs to make new errors obvious.
  • Don't include usernames, logins, passwords, etc. These are development logging issues, not production.
  • An application log should have 10-25% the number of entries of the access log. Too much data hides problems.

In summary, make production logs about helping operations staff solve problems. Good logs can help solve problems. Poor logs can hinder problem solution.

Tags: operations logging velocity08

» Velocity 08: Energy Efficient Operations

Luiz Barroso from Google is speaking about Energy Efficient Operations. Computing has a great track record of having a positive impact on society. The world needs more computing. But more computing means more energy (usually).

World energy use of servers is around 1% of total electricity consumption. Making efficient computers is harder than making efficient refrigerators. Efficiency is computing speed divided by power usage. But that's too simple. For a server, you have to take into account the efficiency of the compute efficiency, server efficiency, and data center efficiency. These get multiplied together. Ugh.

Data centers are underutilized which accounts for a wasted power provisioning investment and less efficient power and air distribution. Typical serve power supplies dissipate 25% of total energy as heat. Computers are the least efficient in their most common operating points.

The operating cost of a data center is about $9/watt over 10 years. But the cost of building the data center is $10-22/watt. Facility costs are more important than operating costs in energy terms. Maximizing usage is a great way to save energy.

Here's some things to do:

  1. Consolidate workloads into the minimum number of machines needed for peak usage requirements
  2. Measure actual power usage of devices. Nameplates lie and overstate usage.
  3. Study activity trends and investigate oversubscription potential. You don't want to go over (bad for machines and bad from a contractural standpoint).

This let's you pack the most servers in your data center that you can. A study at Google showed that you have to be able to spread computing over a larger number of machines in order to really take advantage of oversubscription. At the facility level, you might be able to host 20% more servers through oversubscription.

If you have a search cluster, a map-reduce cluster, and a web-mail cluster, the oversubscription potential is fairly low. But combined, they have substantially more because mixed workloads balance out demand better. Monitor and "victimize" a defined "best-effort" workload when problems arise.

Switching to energy-proportional computing. Consider the data center as a single computer. Call it a "land-held computer." :-) Most of the time aren't idle or at peak (unlike laptops). This is a result of the fact that high-performance and high-availability requires load balancing and wide data distribution. We design them to work this way. The result is there are no useful idle intervals in which to shut a processor down. There are lots of low activity intervals.

An idle server uses about 50% of the peak power requirement. But if you plot efficiency, the server becomes much less efficient below about 30% usage. 100% isn't realistic, but getting over 30% is.

So, energy-proportional computing is the idea of making the efficiency more linear. This would greatly reduce the need for complicated power management. CPUs are actually better at energy proportionality than other components (like RAM, disk, network, fans, etc.) An idle CPU, for example, consumes less than 30% of it's peak power where as DRAM is about 50%, disks are over 75%, and networking is over 85%!

Moreover, CPUs have active low power modes. A CPU at a slower clock rate still executes instructions, but DRAM and disks in low power mode need to bump up to full power to operate.

If there's any question whether this is a good idea, consider that the human body has a factor of 20 from it's resting power consumption to peak (at least for elite athletes).

The most basic thing you can do is to write fast code. This is the software engineer's biggest contribution to energy efficiency.

Throughout the talk Luiz referenced a paper from ISCA07. I believe this is it: Power provisioning for a warehouse-sized computer by Xiaobo Fan, Wolf-Dietrich Weber, and Luiz Andre Barroso

Tags: it+operations energy velocity08

» Velocity 08: Jiffy: Instrumenting and Measuring Web Performance

Scott Ruthfield from WhitePages.com is announcing a new open-source projects called Jiffy, a tool for measuring the end-to-end performance of Web sites (PDF slides). Jiffy provides real data about performance that is more complete and more fine grained than what you might get from Keynote or Gomez. Jiffy has four goals:

  • Real data at scale - track 100% of page views
  • Measure anything - pre load data access, each add, brand, when the form is ready, and so on
  • Real-time reporting
  • No impact on page performance

Jiffy comprises a JavaScript library that instruments the pages, an Apache proxy, a tool for putting log data into a database (Oracle for now), and reporting roll-up code and UI.

The basic idea is "mark and measure." You can set a mark and then make any number of measurements of how much time has elapsed from the mark. You can do immediate or batch submits depending on the requirements or your site and how much bandwidth with want to consume.

Bill Scott of Netflix has written an extension to Firebug for Jiffy.

Tags: performance web velocity08

» Velocity Keynote: IT Operations are Unsustainable

Bill Coleman (the "B" in BEA) is giving the opening keynote titled Green Data Centers, but it's really about sustainable operations. He begins by saying that the current way we operate data centers is unsustainable. Operations costs are growing at twice the rate of IT in general. This is the unintended consequence of the success of networked operations. Scale and complexity have grown dramatically. Bill claims 5 orders of magnitude.

In some organizations it can take 6 months to get a new server into the data center. If you do it faster than that, good for you. Virtualization is a band-aid. It adds to complexity while helping with scale. IT automation helps deal with complexity.

We're nearing a huge inflection points: it's theoretically possible to connect anything to anything. Soon that will be the norm.

What's the answer? The cloud. Today we have Cloud 1.0. I use proprietary tools to build a web application on proprietary platform. Soon those tools will be more sophisticated, but still be proprietary. Cloud 3.0 will make all that a commodity. We will take it for granted.

Power savings comes from actively managing servers--turning them on and off as needed. Real-time, dynamic capacity management is the key, but the real point is that policy-based server management provides for scalability and pooling of resources.

Bill talks about a major corporation with 15,000 virtual machines and they still only get 20% utilization. You can get to 80% on a mainframe because they mix and match jobs--mission critical or not--on the same hardware. Policy-based data center management allows that to happen by tacking up and tearing down VLANs, servers, and storage to meet current demands.

Tags: velocity08 operations data+center it+automation