Friday, May 22, 2015

Setup Hadoop using Sandbox

I am now learning Hadoop, and find the easiest way to setup the environment is to use Hortonworks Sandbox (I also tried to install everything using homebrew, and it took me much longer time to setup everything). I will just list the steps here to setup the environment, and hope this is something also useful to you.

Step 1 - install Sandbox and start it:
Download the Sandbox from here, I downloaded the one for VirtualBox (Mac & Windows). It is essentially setup a Linux server (RedHat) using Virtual Machine with all the needed package installed, e.g. Hadoop, Hive, pig, etc. Follow the instructions from the install guides pdf to start the Virtual Machine.

Step 2 - Login into the Linux server:
You can either use the provided GUI, or use SSH. I prefer working in terminal, so I use SSH.
$ssh root@127.0.0.1 -p 2222 #passwd: hadoop
If you want to put/get files from your local machine, then use the sftp:
$sftp -P 2222 root@127.0.0.1 #Note here the -P flag is upper case
To make things easier, just make alias in .bashrc file, then we can only type 'hssh' to login the server.
alias hssh='ssh root@127.0.0.1 -p 2222'
alias hsftp='sftp -P 2222 root@127.0.0.1'

Step 3 - SSH login without passwd
If you don't like to type the password every time you login, you can do the following:
(1) check if you already have .ssh folder in your home directory
$ls -a ~/.ssh
(2) if you don't have the folder, then generate by using:
$ssh-keygen
(3) Then put the generated key to the server.
$scp -P 2222 ~/.ssh/id_rsa.pub root@127.0.0.1:~/.ssh/authorized_keys

Ok, now all set, and you can use it follow the Hortonworks tutorials.

Wednesday, May 13, 2015

Algorithms for the Ages

Top 10 algorithms in the 20th century

“Great algorithms are the poetry of computation,” says Francis Sullivan of the Institute for Defense Analyses' Center for Computing Sciences in Bowie, Maryland. He and Jack Dongarra of the University of Tennessee and Oak Ridge National Laboratory have put together a sampling that might have made Robert Frost beam with pride—had the poet been a computer jock. Their list of 10 algorithms having “the greatest influence on the development and practice of science and engineering in the 20th century” appears in the January/February issue of Computing in Science & Engineering. If you use a computer, some of these algorithms are no doubt crunching your data as you read this. The drum roll, please:

1946: The Metropolis Algorithm for Monte Carlo. Through the use of random processes, this algorithm offers an efficient way to stumble toward answers to problems that are too complicated to solve exactly.

1947: Simplex Method for Linear Programming. An elegant solution to a common problem in planning and decision-making.

1950: Krylov Subspace Iteration Method. A technique for rapidly solving the linear equations that abound in scientific computation.

1951: The Decompositional Approach to Matrix Computations. A suite of techniques for numerical linear algebra.

1957: The Fortran Optimizing Compiler. Turns high-level code into efficient computer-readable code.

1959: QR Algorithm for Computing Eigenvalues. Another crucial matrix operation made swift and practical.

1962: Quicksort Algorithms for Sorting. For the efficient handling of large databases.

1965: Fast Fourier Transform. Perhaps the most ubiquitous algorithm in use today, it breaks down waveforms (like sound) into periodic components.

1977: Integer Relation Detection. A fast method for spotting simple equations satisfied by collections of seemingly unrelated numbers.

1987: Fast Multipole Method. A breakthrough in dealing with the complexity of n-body calculations, applied in problems ranging from celestial mechanics to protein folding.

From: Random Samples, Science, page 799, February 4, 2000.

Tuesday, May 12, 2015

Empirical Subspace Detection

After talked with the steps of the subspace detection algorithm in the previous blog, I read a paper on the Empirical subspace detection algorithm can be found here.

The idea of the Empirical subspace detection algorithm is this: After you have the design set (a matrix with the repeat waveforms that aligned on P wave), you don't need to calculate the SVD to find the orthogonal basis like the original subspace detection algorithm, instead you calculate the stacking of the waveforms in the design set, and the time derivative of the stacked waveform. The author of the paper found that the stacked waveform is similar to the first basis vector from the SVD, and the derivative of the stacked waveform is similar to the second basis vector. For the first one, it is quite easy to understand, the first basis contains the common features of all the waveforms, that is why it is mimic the stacking of the waveforms. But why the second basis vector is similar to the derivative of the stacked signal is not that easy to see. The author claims that the second singular vector represents information related to the variations produced by slight offsets in earthquake location of the design set earthquakes.

So this forms the empirical part of this algorithm: just use stacked waveform and the time derivative of the stacked waveform in the design set to represent the first two basis vector from the SVD (No need to calculate SVD in practice).

I went ahead to test this concept using some simulated signal, this is what I did:
(1) I generate a signal using Mexican hat wavelet, one pulse at a location to represent P wave, and the other pulse at a later location but with 2 times of the amplitude, then I add white noise in the background.
(2) I generate another 9 waveform based on the above signal but each shift the S wave a little step, (note: the first pulse are aligned to represent the P wave, and the second pulse shifted to represent the S wave shift due to the different location of the event), see the following figure as an example:

(3) I calculate the stacking of these signals, and the time derivative of the stacked signal.
(4) Then I compare the signal from step (3) with the first and second basis vector from the SVD.
(5) I plot the normalized comparison (scale the maximum amplitude to 1) in the following figures:

Conclusion: It does seem the author is correct that the stacked waveform, and the time derivative of the stacked waveform looks assemble the first two basis vector from SVD. For the derivative of the stacking, the S wave part matching with the second basis vector very well (similar results from the author's paper Fig. 4). I also tried different width of the pulse, you can find it in the code or the figures on my Github. Still I can not give a solid physical meaning why the time derivative of the stacking matches the 2nd basis vector. If you know the answer, please let me know ^)^

You can find my code to generate these figures here on my Github. The code will also generate results for different signal width of the Mexican hat wavelet to play with different frequency. Try it.

Acknowledgment: Thanks Taka for discussion, really helpful!