Latest News >> 2008-04-29

UPDATE: There seems to be some interest in Idiopidae too and I’ve been neglecting the Idiopidae project page while working on Vellum. I’ve updated that page with information and updated install procedures, so please check it out again if nothing worked for you.

2008-04-26

UPDATE: I forgot to mention that this is my first LaTeX text ever and that it’s based on one of my first Python projects and that I wrote what you find below in about 3 days while dorking around on Vellum and Idiopidae. Since I’m both a Python and LaTeX newb please feel free to school me. Better yet, if you think the typesetting sucks, then show me your samples. See if you can beat this one sent to me by Kashif Rasul. I still consider what he’s created as the bar to get over.

2008-04-18

While working on a more complex build I decided to make recursive imports work and clean up the syntax for imports in Vellum

2008-04-15

I worked on Vellum today after waking up from jetlag and Poland. I feel like it’s near ready for actual use by people. I even managed to polish it off with a nice little command line option for dumping the commands a build spec uses including their documentation. Check out this Pastie clip that shows it off.

Ruby/Odeum vs. Lucene Part 2

This article attempts to practice what I preach in Programmers Need To Learn Statistics Or I Will Kill Them All by doing my original Ruby/Odeum vs. Lucene analysis in a more formal way. There were several complaints from the Lucene crowd which I try to address, and I also bring up problems with suggested improvements from several other people. I’ll be using the R programming language to do the analysis and produce the graphs. Also, this essay won’t be as funny as my rant (if it was funny that is).

Problems With Part 1

People mentioned a few problems with my first performance analysis of Ruby/Odeum and Lucene. If you read that essay I did cover these as weak points:

  • The Lucene times are mostly caused by JVM start-up. This is very true, and intentional since my goal is to have a code search system for FastCST so command line start-up is a very big deal.
  • Not nearly enough sample runs for a clear statistical analysis. I just did a quick 5 samples and called it a win for Ruby/Odeum. This is really not the best way to do this analysis.

Executive Summary

For the people who have no clue (also known as “Executives”) here’s the information you need to tell all your employees they need to adopt the latest and greatest thing without ever having to understand anything you read. Cheaper than an article in CIO magazine and even has big words like “standard deviation”.

  • Systems tested were Lucene 1.4.3, Ruby/Odeum 0.3.1, QDBM Odeum 1.8.26 (from QDBM 1.8.26)
  • Ruby/Odeum is 1.75 times slower than Lucene once both are in a steady-state.
  • Lucene is 1.2 times slower than QDBM Odeum once both are in a steady-state.
  • Ruby/Odeum is 2.1 times slower than QDBM Odeum once both are in a stead-state.
  • Lucene’s ramp-up time is around 2 sets of 5500 queries (actually around 3 minutes is better).
  • Ruby/Odeum and QDBM Odeum ramp-up time is less than 1 set of 5500 queries (around 10 seconds).
  • A t-test shows that the sampled response times are all statistically significant in difference. This means that if we re-ran this study we’d most likely get similar timing differences.
  • Lucene uses anywhere from 3.6 to 36.76 times the memory that Ruby/Odeum does depending on the JVM used.
  • Ruby/Odeum uses 1.93 times the memory that QDBM Odeum does.

My Performance Analysis Process

I’m going to give my method for conducting an experiment to determine the performance of a system. It is based on your typical scientific method, but I’ve adapted it slightly to accommodate how a computer works. With a computer I have the advantage that I can pretty much sample as much as I want from any particular component. In a biological experiment or a manufacturing control sampling you’ll end up destroying the items you sample so you’re limited to random selection.

My process is pretty simple:

  1. Formulate a question you need to answer which is succinct, tests the smallest measurable thing possible, and is exact. This is typically the hardest part of the experiment.
  2. Develop a set of variables to measure in order to answer the question.
  3. Design the experiment based on the question and variables. Key problems to avoid when designing this experiment are:
    • Confounding—Only test one thing at a time, keep everything else the same or randomized.
    • Cross Contamination—Make sure that runs from one test do not harm the other tests.
    • Steady States—Make sure to drop initial measurements until you have the stead-state if you’re testing long running process performance.
    • Sampling Error—Data should perform randomly within given ranges. If you get wild swings, very similar results, sudden spikes, etc. then you’ve got an error in your run.
    • Measurement Error—This is where the run is right, but you measured it wrong. It looks similar to a sampling error, except that you probably have your math wrong some place. For example, if you round at the wrong time you’ll lose precision later on.
  4. Do a small run of the experiment to verify the design.
  5. Use the small run to determine a proper sample size.
  6. Run the full experiment.
  7. Perform the analysis on the results and write up proposed conclusions.

You’ll notice that I don’t mention developing a hypothesis in this process. The use of hypothesis testing in statistical analysis is actually a little controversial, with the majority of the scientific community using a hypothesis method, and folks in statistical process control avoiding them. The argument against using an hypothesis is that it will influence your experimental design and doesn’t really prove anything. The argument for using an hypothesis is that you need to have something to prove otherwise you can’t make a valid statement in your analysis.

I’m more practical than either of these view points. I don’t use a hypothesis if I’m simply comparing the performance of some systems or gathering data to find out what is going on. I do use a hypothesis if I’m trying to improve the performance of a system. I’ll simply say, “My hypothesis is that this change will be a statistically significant improvement in performance over the previous version.” I’ll also develop an hypothesis when someone else makes a claim that their system is better than another one, but this is again implied in the two data sets.

Research Plan

The big question I want to study is, “How do Ruby/Odeum, Odeum, and Lucene compare in querying performance and memory consumption?”

That’s pretty much the gist of it, but this isn’t really exact enough for our purpose. This question is much too vague to build up a decent experimental design. What we need to do is refine the question into very specific variables that we can measure and test in a consistent and clear way.

Refining The Question

There’s actually a whole process you can follow to refine a hypothesis or research question, and it’s remarkably similar to the process of doing an object-oriented domain decomposition. If you’re really interested read Methods of Social Research Bailey. It’s written for social science students, but it has a very good explanation of developing this kind of question. I’ll just cut to the chase and layout the refinement before people get even more bored.

What we really want to know is:

  1. The time in seconds (or fractions of seconds) to process the same word and access the URI/path for each returned document for each of Lucene 1.4.3, Ruby/Odeum 0.3.1, and QDBM Odeum 1.8.26.
  2. The 1 second VSZ+RSS memory samples during the entire run.

Including the direct QDBM Odeum (which is what Ruby/Odeum wraps) lets me see what is possible with Ruby/Odeum, and also lets me get a kind of baseline performance since QDBM Odeum is written in pure C. This leaves us with two variables we’re analyzing:

  1. “time”—The time to process the chosen word in seconds measured by reading the available time before the run, then the time after and subtracting.
  2. “memory”—The VSZ + RSS reported by the ps command at 1 second intervals while the test ran.

Finally, I have a few assumptions and constraints I’m going to follow:

  1. Samples will be done in sequential bursts with the index being closed and re-opened between each run.
  2. Ramp-up will be included in the initial measurement and not excluded in order to determine the effect of ramp-up on the results later.
  3. The chosen word will be “sprintf” as in the previous study in order to avoid “cheating”. An informal study has shown that response set size is a factor in response time, so it’s important that all queries use the exact same word.
  4. The three systems being tested are:
  5. The final analysis will be done without the ramp-up to get a clear picture of the steady state of each system.

Experimental Setup

My system configuration is the same as the previous informal study and only the test process was altered. The system is:

  • A Linux 2.6.11.7 with ReiserFS and 1G of RAM on a 2.4Ghz Celeron processor.
  • Java™ 2 Runtime Environment, Standard Edition (build 1.5.0_02-b09).
  • Test data used is the Linux 2.6.11.4 source code.
  • All unnecessary software was shutdown during the test and not disturbed.
  • Lucene 1.4.3
  • Ruby/Odeum 0.3.1
  • QDBM Odeum 1.8.26 which is actually the same Odeum backing Ruby/Odeum.
  • Single word being searched for is “sprintf”. Same as last informal study.

Initial Data Gathering

One of the biggest mistakes people make (and companies make on purpose) is to not verify that their sample sizes are good enough to support their accuracy requirements. Not taking enough measurements is like trying to measure microseconds with 16-bit integers. You could get lucky and have microseconds that fit in the chosen 16 bits, but you have no idea if this is true or not. Using an arbitrary sample size for your performance measurements is similar since you could choose way to many or way too few.

There is a chicken-and-eggs problem though since you need to have some data in order to make sure your data is valid. normally you would make an educated guess about the sample size or run a few simulations. My plan is to do a small set of 10 sample runs of 100 each, and then use that data to figure out an optimal set of sample runs and counts. I’ll take the samples, do a few useful graphs of each the sample means, and then use a power test to verify that I have a good sample size.

Statistical Power Test

The graphs and data for this initial run are available in a separate data directory for people to review. The main thing is that if I do a boxplot of the three time samples I can see that Lucene is going to end up being the one with the widest range.

What this graph shows is that Lucene has a slightly wider range (it’s box is fatter), so when I do my statistical power test the Lucene runs will require more than the other systems. To do my power test I’m using the power.t.test function in R. This function takes information about the kind of accuracy you want and the standard deviation you expect, and then tells you the number of samples you’ll need per sample run.

> lupower <- power.t.test(delta=0.001, sd=sd(lut$time), sig.level=0.01, power=0.9)
> odpower <- power.t.test(delta=0.001, sd=sd(odt$time), sig.level=0.01, power=0.9)
> ropower <- power.t.test(delta=0.001, sd=sd(rot$time), sig.level=0.01, power=0.9)
> c(lupower$n, odpower$n, ropower$n)
[1] 4081.330 1424.565 1706.061

Looks like Lucene’s run needs a lot more than the others, so we’ll end up having to use Lucene’s power requirements. An additional concern is that the data is not really normally distributed, which sucks because power.t.test is intended for that distribution. A general rule of thumb is that you can multiple the test results by 1.3 and cover yourself for any other distribution. I’ll just do this since computing is cheap and I can run the test over night. Our sample size will be 5306 samples, but let’s just say 5500 for good measure.

We’ve now got our sample sizes, but how many actual samples (runs) do we need? We can again use the power.t.test on the “sample means” to figure out how many we need. What’s a sample mean? The theory of sampling says that if we do a bunch of independent samples, and then take something like the mean or median, then this series of measurements will be normally distributed no matter how the actual data is distributed. In other words, if I just do my 5500 samples once then I have only one sample and one mean. If I repeat this 5500 say 10 more times, take the mean of each sample run, then I have 10 sample means. These sample means will be normally distributed as I take more of them (closer to 30).

I’m really only interested in the sample means in this analysis. The blue line is the mean, and the red line is two standard deviations from the mean.

A graph like this will let you easily see the ramp-up period prior to reaching the steady state of the system. It would not show you possible performance problems from assignable causes. For that you’d want to use this chart to find the steady state, then randomly pick one of the sample runs and do a complete run chart of all the samples. Anything that is outside the red lines (two standard deviations) is suspect and you should check it out.

If we run the power.t.test on this data then we can get an idea of how many sample runs we’ll have to do to get good accuracy:

> power.t.test(delta=0.001, sd=sd(lus), sig.level=0.01, power=0.9)

     Two-sample t test power calculation 

              n = 125.6942
          delta = 0.001
             sd = 0.002041483
      sig.level = 0.01
          power = 0.9
    alternative = two.sided

 NOTE: n is number in *each* group 

Alright, that’s ridiculous, but that’s what the math seems to support. According to this we’ll need to run 69300 total timing tests to get even close to the accuracy we want and Ruby/Odeum being the slowest takes 0.032 seconds for each query. That means a quick calculation:

> mean(ros) * 126 * 5500 / 60 / 60
[1] 6.179075

This is the mean time of Ruby/Odeum, times the number of sample runs, time the number of samples, then divided by 60 / 60 to get the number of hours. And it’ll take 6.18 hours to run this test for just Ruby/Odeum. That’ll take forever probably pushing 12 hours with the other two runs included.

Back To Reality

What if we don’t have 12 hours to run this test? In our case we have a couple of choices:

  1. Reduce the delta parameter risking not detecting small differences in the results.
  2. Reduce the sig.level to 0.05 risking that we reject our hypothesis when it’s actually correct (called a Type I error).
  3. Reduce our power to 0.85 or 0.8 and risk accepting a hypothesis when it is actually false (Type II error).

If I have a choice, I prefer to reduce the size of the runs but try to do as many runs as I can. We’ll do this by accepting a different set of acceptable accuracy. I re-ran the power.t.test functions with delta=0.002 and sig.level=0.05 and came up with 22 sample runs of 721 each. This should take about half an hour to run.

I actually prefer to run the longer experiment in this case so that people from the Java camp don’t complain about JIT compilation and Hotspot usage and other Java start-up annoyances. As you’ll see later this is a good choice since it demonstrates the ramp-up period really well.

Second Stage Data Gathering

The second stage data collection was run using a set of shell scripts and then an analysis R script. The shell scripts just run each of the different programs in a consistent way. It’s harder to run your tests consistently if you don’t script them. The gist of the analysis is summarized in the following tables.

Data Minimum Median Mean Max SD
Lucene 0.01500 0.01700 0.01749 0.17900 0.004988786
Ruby/Odeum 0.02761 0.03091 0.03059 0.23910 0.001423409
QDBM Odeum 0.01352 0.01412 0.01456 0.05562 0.002730304

And some ratios comparing the mean and standard deviation (SD) for the different systems:

Comparison Mean Ratio SD Ratio
Ruby/Odeum vs. Lucene 1.749767 0.3699456
Ruby/Odeum vs. QDBM Odeum 2.100681 24.31456
Lucene vs. QDBM Odeum 1.200549 65.72469

These tables are actually very misleading for a reason I’ll cover in the Results section. Simply take them as the raw data results which we’ll refine later in the Results section.

Results

The first thing that I have to do is graph these three so I can see if there’s anything wrong with their data. If I do a simple comparison chart of the sample means you can see the first major problem:

Right away we have a ramp-up problem which is making the Lucene results look much worse than they actually are compared to the other two. Simply removing the first 2 runs would get rid of the ramp-up for all three and give us a better idea of the steady-state performance.

This also changes our previous two comparison tables, summary of statistical attributes:

Data Minimum Median Mean Max SD
Lucene 0.01730 0.01744 0.01745 0.01761 0.0000649
Ruby/Odeum 0.03038 0.03060 0.03060 0.03094 0.0001302
QDBM Odeum 0.01455 0.01456 0.01456 0.01457 0.00000486

And the ratios comparing the mean and standard deviation (SD) for the different systems:

Comparison Mean Ratio SD Ratio
Ruby/Odeum vs. Lucene 1.753899 2.006577
Ruby/Odeum vs. QDBM Odeum 2.10079 26.77634
Lucene vs. QDBM Odeum 1.197783 13.34429

That changes the standard deviation (SD) for Lucene in a big way, making it actually better than Ruby/Odeum. If we include the ramp-up period then it looks like Lucene is all over the map and has really poorly behaving performance.

What I’ve done here is slightly modified our question to say we only are interested in the data collected after any ramp-up period. Since our comparison graph shows that Lucene has a huge ramp-up of nearly 2 full sample runs (2 sets of 5500 queries each) we get a better result by dropping these as if they were outlier data.

Is this a fair thing to do though? Having a system that doesn’t start performing until it’s done some 11000 queries may seem a little ridiculous, but according to this data that’s about 3 minutes of constant pounding. The other two systems (Ruby/Odeum and Odeum) get up and running at a steady state right away, and what’s interesting is their performance gets slight worse after the first run (see how the Minimum is higher in the second set of tables?).

Overall I prefer to show both sets of data and let people decide for themselves. If you’re running Lucene in short bursts (such as in a command line tool) then this ramp-up period is probably the only performance you’ll ever see. If you run it as a server than you’ll see the steady-state performance after about 3 minutes of constant activity.

Memory Results

Memory is kind of strange on most systems since there’s not too clear of an idea of what is considered actually used memory vs. just what the operating system has assigned to that process. Many operating systems will keep recently freed memory around in order to hand it back to the process for more usage. Typically you’ll see the memory increase for a while and then steady out. Finally, memory for runs like this should stay fairly constant over time.

Our memory data shows pretty much this behavior, and gives a shocking memory result (which I’ll explore more to rule it out as a problem).

Data Minimum Median Mean Max SD
Lucene 277000 278100 278100 278200 105.8393
Ruby/Odeum 7336 7496 7495 7568 8.727567
QDBM Odeum 3620 3928 3927 3928 9.918917

This is the VSZ + RSS (in kbytes) sampled at 1 second intervals while each test ran. The SD for these isn’t really that big of a deal. It would matter if we saw a really huge SD for one of them which would imply a really bad memory usage pattern I’d want fixed before I did more comparisons.

I received a few comments from Erik Hatcher (author of Lucene in Action a fine book I’ve been inclined to buy) about the memory results for Lucene bringing up some issues. A few other people from Lucene also complained about the memory part (and so far really only the memory part). Even some folks going so far as to say “he still doesn’t have a clue.” Also people said that all I did was prove that the default memory settings for the chosen JVM were bad.

I decided to do a deeper analysis of the memory consumption that Lucene produces. Let’s just test out this assumption that these memory usage ratings are based on the default setting for -Xmx and -Xms. Hell, let’s even test out as many different JVMs as possible. I re-ran the tests with these changes:

  1. Removed the outer 126 sample run loop which causes the index to be opened and closed between sample runs to reach a default state. (This design does not change the performance measurements, I verified this earlier.)
  2. Changed the test so that it just does one small short 750 sample run rather than 5500. This was done just to save time as the memory starts to stabilize after around 500 or so.
  3. Downloaded as many Java Virtual Machines as I could get my hands on, and finally found four that could run the test:
    • Kaffe – Engine: Just-in-time v3 Version: 1.1.5 Java Version: 1.1
    • Sun JDK – Java™ 2 Runtime Environment, Standard Edition (build 1.5.0_02-b09) Java HotSpot™ Client VM (build 1.5.0_02-b09, mixed mode, sharing)
    • IBM JDK – Java™ 2 Runtime Environment, Standard Edition (build 1.4.2) Classic VM (build 1.4.2, J2RE 1.4.2 IBM build cxia32142sr1a-20050209 (JIT enabled: jitc))
    • BEA JRockit – Java™ 2 Runtime Environment, Standard Edition (build 1.4.2_05-b04) BEA WebLogic JRockit™ 1.4.2_05 JVM R24.4.0-1 (build ari-38120-20041118-1131-linux-ia32, Native Threads, GC strategy: parallel)
  4. Redesigned the test harness to do a permutation of each -Xmx and -Xms option from the set of 1M,8M,16M. Settings higher than 32M didn’t improve the memory consumption (as expected).
  5. Re-ran the tests for each permutation of -Xmx and -Xms with each JVM and the newly recompiled Lucene jars.

Doing this produced the following table of memory consumptions by JVM.

JVM Minimum Median Mean Max SD
ibm 33110 56770 56720 80830 8192.050
jrockit 42500 47840 47860 57400 4001.254
kaffe 26570 33120 34910 51560 6863.516
sun 206000 218000 224000 236700 7882.041

This table doesn’t show the mixture of -Xmx and -Xms settings, but this wonderful coplot sure does:

You read the coplot by finding the two gray bars on the top and right side for -Xms and -Xmx, and then find where they intersect in the center. That is then the graph of JVM by total memory consumption for those two settings of -Xms and -Xmx. If a JVM is not listed on that panel then it didn’t have a measurement. This is because of the JVM aborting due to a setting it considered invalid. Some JVMs also reported an error for the minimum setting and then adjusted it to the maximum setting, but this didn’t seem to change the memory usage levels when I analyzed it.

In the end, after all this work, what we show is that most JVMs are much lower on the memory consumption scale the original Sun JDK. Sun’s JVM is just horrendous and a total pig, but even still the best memory consumption is still at 26570k. This absolute lowest possible memory consumption is still at least 3.6 times the Ruby/Odeum levels.

Another point I’d like to make (which seems to get lost on people who read this part of the study) is that I really don’t consider this Lucene’s problem. This is a classic JVM issue that’s been around forever. No matter how you slice it Java is a fat ugly ass memory hog.

This really demonstrates the validity of how I’m doing this experiment. With other performance analysis reports you’d have to kill someone to get at their data and source code. This makes it harder to prove alternative test cases or that the author has run the experiment fairly and accurately. With this experiment everything is open so folks like Erik who know way more about Lucene than I do are able to take the code and simply show me I’m wrong. If he can get the memory consumption down then he’s done and I can re-post the results. No need to argue, the data wins out every time.

It’s weird how 3 and 6 keep showing up in this analysis. I swear I’m not fixing the data.

Conclusions

The initial conclusions are simply summarized as:

  • Systems tested were Lucene 1.4.3, Ruby/Odeum 0.3.1, QDBM Odeum 1.8.26 (from QDBM 1.8.26)
  • Ruby/Odeum is 1.75 times slower than Lucene once both are in a steady-state.
  • Lucene is 1.2 times slower than QDBM Odeum once both are in a steady-state.
  • Ruby/Odeum is 2.1 times slower than QDBM Odeum once both are in a steady-state.
  • Lucene’s ramp-up time is around 2 sets of 5500 queries (actually around 3 minutes is better).
  • Ruby/Odeum and QDBM Odeum ramp-up time is less than 1 set of 5500 queries (around 10 seconds).
  • Lucene uses 3.6 to 36.76 times more memory that Ruby/Odeum does depending on the JVM you use.
  • Ruby/Odeum uses 1.93 times more memory that QDBM Odeum does.
  • A t-test shows that the sampled response times are all statistically significant in difference. This means that if we re-ran this study we’d most likely get similar timing differences.

This information is supported by the data so I won’t go into it much. I do have some additional observations which I’ll call my “opinions” rather than an inference. I prefer to let people take the data and scripts I’ve written and do their own analysis. I also encourage people to suggest improvements and expand on this analysis.

Opinions

Ruby/Odeum’s performance compared to the QDBM Odeum shows that the Ruby binding is cutting the performance in half. I can think of a couple of reasons why this could happen, but the primary one I plan to investigate is how the query function returns all of the results as a large Ruby array. This involves a lot of data copying and the each operator might not be the best performing way to do this. An alternative is to have a simple iterator setup that gets the documents directly, pushing the iteration of document IDs back to the C interface. I’ll actually be testing this in the next release to see if it improves performance enough to justify the change to the API.

Because Ruby/Odeum’s standard deviation (SD) after reaching steady-state is higher than the other systems, I predict that it will not perform as well under a heavier load. My experience says that a system with a high SD during testing will degrade much quicker when the load is increased. This is just a hunch, but I think a good test would be to make a green threads version of the Lucene system and then run Ruby/Odeum and Lucene tests again with a large number of threads. The reason I’d use green threads in Lucene is that’s the closest thing to Ruby’s threading system. The QDBM Odeum performance is so tight that I wouldn’t even bother with this test.

In general I’m very impressed with Lucene’s performance, and also very happy with Ruby/Odeum’s performance. Considering that QDBM Odeum is faster than Lucene there’s hope that Ruby/Odeum can become even more competitive with some additional work. Still, I feel like this is arguing over nothing since we’re talking tenths of a second differences to search 18174 documents for a single word.

My overall recommendation is that Ruby/Odeum seems to compete reasonably well with Lucene, and has advantages in memory footprint. I caution people to not take these results and assume that these systems will perform the same when integrated with their software. You should re-run these same tests with your own harness if you have concerns about the performance of these systems in new situations. Otherwise, you can probably go with this data and be reasonably secure that Ruby/Odeum or Lucene is fast enough for most of your search needs.

Available Code & Data

Everything I used is available for others to run, and all of the tools I used are free. You are free to take these programs and use them for your reference, copy them, send them around whatever. I donate them to the public domain unless someone else had a license first (like odmgr.c and SearchFiles.java).

You’ll notice this is different from most other “performance studies”. Other studies hide the data and results from you in order to sell you something. I want you to make an educated decision and to improve the analysis to suit your needs. You should start demanding this information from your vendors and study authors as without you can’t verify that they weren’t cheating.

  1. Source Code
  2. Initial 100 sample data
  3. Full sample data
  4. Range comparison chart with ramp-up
  5. Range comparison chart no ramp-up

Credits

I’d like to thank the following people for reviewing this article for accuracy, grammar, and spelling: