Sunday, October 22, 2017

Boston Visit

This week, I was traveling to Boston to give a talk at Harvard, MIT and Boston University. All these 3 universities are very nice and have different characteristics. Here are some photos I took during the visit, and hope you can see the difference between them as well. 
Harvard
jpg
jpg
jpg
jpg
jpg
jpg
jpg
jpg
jpg
jpg
jpg
jpg
Boston University
jpg
jpg
jpg
jpg
jpg
jpg
MIT
jpg
jpg
jpg
jpg
jpg
jpg
jpg
jpg
jpg
jpg
jpg
jpg
Boston Downtown

jpg

jpg

jpg

jpg
jpg

Sunday, October 1, 2017

Some nice videos I saw recently

Recently, I watched some really nice talks, and here are some of them that gave me a lot of ideas:

Andrew Ng - AI is the new electricity! 

Key messages:
Supervised learning -> Transfer learning -> Unsupervised learning -> Reinforcement learning, this is are in the order of decreasing the generated value in the industry.  Maybe the same in academia. Also, I agree that AI is the new electricity.

Ruchir Puri - Engineering the future of AI for business

Key messages:
How to learn from small data instead of big data? That is how to generalize well from a small dataset. Like a baby, he or she only sees a few pictures of dogs and cats, but then they can recognize most of the new images with different dogs and cats, there must be a way to do small data learning well!

Feifei Li - ImageNet, where have we been? Where are we going?

Key messages:

“Datasets - not algorithms - might be the key limiting factor to development of human-level artificial intelligence” - Alexander Wissner, this presentation gives a very nice history of the development of imagenet and how it started the deeplearning revolution. It advocates the importance of data, and the idea behind it is very interesting!


Saturday, September 23, 2017

Estimate alert time from Earthquake Early Warning System

Concept of Earthquake Early Warning (EEW)

I have been working on Earthquake Early Warning for more than 7 years now. But I realize that I never talked here about earthquake early warning system in my blog. Therefore, I will give a brief introduction of earthquake early warning system here today, and show you how to calculate the alert time.
The fundamental idea behind earthquake early warning system is: Electronic signal travels much faster than seismic waves. It is essentially light speed vs sound speed, therefore, we usually first see the lightning before we hear the sound, the time difference is the warning time.
png
When earthquake occurs, usually it generates two types of waves - p and s waves. P wave (Primary or Pressure wave) travels at about 6 km/s with usually small amount of energy, while S wave (Secondary or Shear wave) travels at about 3 km/s with a large amount of energy. The s wave usually causing a lot of damage during the earthquake, therefore, scientists thought, if we can detect the p wave fast, then we can issue an earthquake early warning to places that s wave hasn’t arrived yet by taking advantage of the light speed of the electronic signal. Let’s see a very nice explanation of the EEW from MSNBC’s Rachel Maddow:

In the following, I will generate two figures to illustrate how to estimate the EEW alert time using an example of M6.0 Napa earthquake in 2014. Note that: This is just a demo to show how to calculate the warning time from the earthquake early warning system. The numbers showing in the following example is not exactly the same from the real performance of the system in the earthquake. In this example, we can specify 3 different velocity models which will give you slightly different results in terms of the warning. I will just show the figure and if you want to know how it is calculated, please find all the code from Qingkai’s Github.
# load all the functions
from warning_relationships import *
%matplotlib inline

Configuration

Change the location of the earthquake, origin time, alert sent out time and your location. Then you can run the two functions below to plot the warning time on a figure or on a map. Showing here is an example of 2014 M6.0 Napa earthquake in CA. I put the user location at Berkeley.
evla = 38.2155   #earthquake latitude
evlo = -122.3117  #earthquake longitude
evdp = 11.25  # earthquake depth in km
mag = 6.0  #earthquake magitude

#event origin time
evt0 = UTCDateTime("2014-08-24T10:20:44.000")

#alert time from EEW (when the EEW system send out the alert)
alertT = UTCDateTime("2014-08-24T10:20:49.100")

#Your location
user_lat = 37.87
user_lon = -122.26

Plot the warning time

This can show the p and s phase arrival time from 3 different models ((1) iasp91, (2) ak135, (3) common, and also the alert time. The common model is just simply assume the P wave velocity as 6.1 km/s and the S wave speed as 3.55 km/s. You can disable P travel time curve, S travel time curve or alert time line by setting them to False. you can choose plot without the city labels by setting the ‘show_city’ as False. If you choose to plot out the cities, then you need to specify a dictionary with city names as the key, and value is a tuple of latitude, longitude text relative to the s wave arrival in x direction, text relative to the s wave arrival in y direction. For Example:
'BSL':(37.87, -122.26, -30, 1.1)
'Name':(lat,   lon,     dx,  dy)
dx and dy controls where you plot the text of the cities and the corresponding arrow
cities = {
          'Napa Valley Opera House':(38.29950, -122.28568, 20, -3.1), 
          'Your Location':(user_lat, user_lon, -30, 1.1), 
          'SF City Hall':(37.7792, -122.4191, -45, 1.1),
          'Central San Jose':(37.3333, -121.9000, -50, 1.1),
          }
plot_P_and_S(evla, evlo, evdp, evt0, alertT, cities, max_dist = 120, max_T = 50, show_city=True, showP = True, showAlertT = True, model = 'common',show_title = True)
png
Figure Description: This figure showing p and s waves from the M6.0 Napa earthquake in 2014. Distance from the epicenter is showing on the x-axis and the time since the origin of the earthquake is showing on the y-axis. P and s waves are shown as blue and red lines. A dotted line is showing the time of the earthquake alert sending out. Cities specified by the user will be plotted on the figure too, with a thick black line indicate the time from the alert to the time of the s wave arrive, this is the warning time. In this case, Napa Valley Opera House didn’t get a warning because it is close to the epicenter and the warning sent out after the s wave arrives at this location.

Plot warning time on map

This can plot the warning time as circles on the map. We can also plot some cities as well by seeting show_cities to True. Besides, we can show the blind zone (places not getting alert) as a red circle. We can calculate the alert time based on 3 different models ((1) iasp91, (2) ak135, (3) common, and also the alert time. The common model is just simply assume the P wave velocity as 6.1 km/s and the S wave speed as 3.55 km/s. Cities are stored as dictionary, see the above section for the explain.
Note: If you want to quickly plot the map, change show_counties = False, and resolution = ‘l’
cities = {
          'Napa Valley Opera House':(38.29950, -122.28568, 0.01, 0.03), 
          'BSL':(37.87, -122.26, 0.01, 0.03), 
          'SF City Hall':(37.7792, -122.4191, 0.01, -0.06),
          'Central San Jose':(37.3333, -121.9000, 0.01, 0.03),
          }

plot_warningTime_on_map(evla, evlo, evdp, evt0, alertT, map_range = 1.2, cities = cities, show_cities = True, figsize = (10,10), resolution = 'h', pretty = False, show_distance_label = True, show_counties = True, show_blind_zone = True, show_legend= True, model = 'common',show_title = True)

png
Figure Description: This figure showing alert time from the M6.0 Napa earthquake in 2014 on a map. Red star is the location of the earthquake. Blue dots are the cities specified by the user. The red circle is the blind zone where inside this zone, there will be no alert. 5s, 10s, 15s, and 20s warning zone is plotted on the map by the green circles.

Friday, September 15, 2017

Fun: Places I hiked before (Qingkai's Hiking Map)

I love hiking, and before we had our kids, we usually go hiking or camping every weekend. Even with our daughter, we still try to go out every weekend with her on my back. We once had an ambitious plan to hike in all the parks in Bay Area before I graduate,  but now it seems hard to achieve the goal!


In the following, I plot all my hiked places in Bay Area (Plus some in Northern CA) on a map, and many places we went more than once! I am really impressed by how many places we visited in the last few years, and hope we will increase more hiking dots on the map.

 

Friday, September 8, 2017

Tricks: How to search in Gmails

Many times we need to search some emails in the Gmail, but instead of simply searching for a name or key word, there are many useful tricks in Gmail that can save you a lot of time. Here are some of the tricks I use all the time for search in my life. 

Using different operators

  • Finding emails with Obama in subject line
subject:Obama
  • Finding the unread emails from Obama with attachment that emails
is:unread from:Obama has:attachment
  • Finding emails larger than 10 MB in size: 
size:10MB
  • Finding emails larger than 5 MB and smaller than 10 MB in size: 
larger_than:5m smaller_than:10M 
  • Finding emails larger than 5 MB and older than 1 year
larger:5m older_than:1y
  • Finding emails with word document as the attachment
filename:.doc
  • Finding all the emails with attachment in drafts
in:drafts has:attachment 
  • Finding emails between two dates
after:2011/08/22 before:2011/08/31 

Combining operators

By default, Gmail combines with multiple operators with 'AND', for example, the above one is finding the emails after 2011/08/22 AND 2011/08/31. But there are more options:
  • Search exact phrase
"good day"
Note that, if you search only for good day, you will get results from emails contain both good and day in them, but they may not contain good day phrase. 
  • Search emails containing either Iron man OR Spider man
"Iron man" OR "Spider man"
  • Search emails not containing day but have good in it
good -day
  • Search emails have dinner and movie in the subject line
subject:(dinner movie)
This will find emails have both dinner and movie in the subject line, but they are maybe not a phrase.

Monday, September 4, 2017

Article: Seismic Data from Smartphones - MyShake: Building a Global Smartphone Seismic Network

Our new article - Seismic Data from Smartphones - MyShake: Building a Global Smartphone Seismic Network just came out at GeoStrata, and it gives an overview of MyShake system and the potential use of MyShake in the engineering communities. Have a look at it, you can download this article from here



Sunday, August 27, 2017

Book Review: A Student's Guide to Waves

Recently, I read the 'A Student's Guide to Waves' by Danel Fleisch (he has different student's guides, all very good, check it out), it is such a pleasant reading that I wish I have read it when I first studied waves. I recommend anyone wants to learn waves, or have already learned to go through this book (you will find it you go through it very fast). It is truly a student's guide and if in the future I will teach this subject, I am sure, this will be my class text ^)^

The very nice part of this book is that it explains everything in plain English. All the concepts and equations are explained like reading a story that you just want to follow with the author to understand deeper. Besides, the book is only ~200 pages, and each section is short, makes it a book that you can read anywhere (I actually read this book mostly on the flight or on Bart). The author has very deep understanding of the subject that he gives a lot of the nice explanation that I never read from other books (I am a student in Seismology, I read many books talking about the mechanical waves, but most of the time, I finish the book with more confused view about waves, it took me long time to understand it).

This book starts with the fundamentals of waves, concepts like the wavenumber, complex numbers, Euler relations, wavefunctions, etc. are introduced here. These are basics for learning more of the waves. The author did very nice job showing how did these concepts come up, and accompany with the figures, these concepts become very clear.

Afterwards, the book talks about the wave equation. How the wave equation derived in a simple way, and why it is the 2nd partial derivative are all nicely explained it here. Also, there are many details in the equations that we often ignore but pointed out by the author which help us to understand better of the subject.

Later, the book gives the general solutions to the wave equation and the importance of the boundary conditions. After all these, the Fourier synthesis and Fourier analysis are discussed with the aids of many figures that you will find that the important Fourier synthesis and analysis are really simple and will store into your mind forever. It even talks about the 'uncertainty principle' between the time/frequency domain and the distance/wavenumber domain that dominant many analysis in practice.

The last part of the book deals with specific types of waves, i.e. mechanical wave equation, electromagnetic wave equation and the quantum wave equation. Armed with the concepts and equations you learned before, you will find how to apply them to specific types of waves in the real world to address some of the interesting problems. Even though I am a seismologist, and mostly interested in the mechanical waves, but I found the electromagnetic and quantum wave equations are also very interesting. I was so impressed by the way all the nature phenomenon links to wave equation in various forms.

Overal, it is a great short book that suitable for beginners or more advanced researchers.






Saturday, July 29, 2017

Machine learning 17: Using scikit-learn Part 5 - Common practices

The material is based on my workshop at Berkeley - Machine learning with scikit-learn. I convert it here so that there will be more explanation. Note that, the code is written using Python 3.6. It is better to read the slides I have first, which you can find it here. You can find the notebook on Qingkai's Github. 
This week, we will discuss some common practices that we skipped in the previous weeks. These common practices will help us to train a model that generalize well, that is perform well on the new data that we want to predict. 
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('seaborn-poster')

%matplotlib inline

Classification Example

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import preprocessing
#get the dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Split the dataset into a training and a testing set
# Test set will be the 25% taken randomly
X_train, X_test, y_train, y_test = train_test_split(X, y,
 test_size=0.25, random_state=33)
print(X_train.shape, y_train.shape)
(112, 4) (112,)
X_train[0]
array([ 5. ,  2.3,  3.3,  1. ])
Let's standardize the input features
# Standardize the features
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
X_train[0]
array([-0.91090798, -1.59761476, -0.15438202, -0.14641523])
#Using svm
from sklearn.svm import SVC
clf = SVC()
clf.fit(X_train, y_train)

clf.score(X_test, y_test)
0.94736842105263153

Pipeline

We can use pipeline to chain all the operations into a simple pipeline:
from sklearn.pipeline import Pipeline

estimators = []
estimators.append(('standardize', preprocessing.StandardScaler()))
estimators.append(('svm', SVC()))
pipe = Pipeline(estimators)

pipe.fit(X_train, y_train)

pipe.score(X_test, y_test)
0.94736842105263153
When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set. However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets. A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”: A model is trained using k-1 of the folds as training data; the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy). The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as it is the case when fixing an arbitrary test set), which is a major advantage in problem such as inverse inference where the number of samples is very small.

Computing cross-validated metrics

The simplest way to use cross-validation is to call the crossvalscore helper function on the estimator and the dataset.
from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipe, X, y, cv=5)
scores 
array([ 0.96666667,  0.96666667,  0.96666667,  0.93333333,  1.        ])
The mean score and the 95% confidence interval of the score estimate are hence given by:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()))
Accuracy: 0.97 (+/- 0.02)
It is also possible to use other cross validation strategies by passing a cross validation iterator instead, for instance:
from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0)
cross_val_score(pipe, iris.data, iris.target, cv=cv)
array([ 0.97777778,  0.93333333,  0.95555556])

Using cross-validation choose parameters

For example, if we want to test different value of C vlaues for the SVM, we can run the following code and decide the best parameter. We can have a look of all the parameters we used in our pipeline by using get_params function. 
pipe.get_params()
{'standardize': StandardScaler(copy=True, with_mean=True, with_std=True),
 'standardize__copy': True,
 'standardize__with_mean': True,
 'standardize__with_std': True,
 'steps': [('standardize',
   StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('svm', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
     decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
     max_iter=-1, probability=False, random_state=None, shrinking=True,
     tol=0.001, verbose=False))],
 'svm': SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False),
 'svm__C': 1.0,
 'svm__cache_size': 200,
 'svm__class_weight': None,
 'svm__coef0': 0.0,
 'svm__decision_function_shape': None,
 'svm__degree': 3,
 'svm__gamma': 'auto',
 'svm__kernel': 'rbf',
 'svm__max_iter': -1,
 'svm__probability': False,
 'svm__random_state': None,
 'svm__shrinking': True,
 'svm__tol': 0.001,
 'svm__verbose': False}
C_s = np.linspace(0.001, 1000, 100)

scores = list()
scores_std = list()
for C in C_s:
    pipe.set_params(svm__C = C)
    this_scores = cross_val_score(pipe, X, y, n_jobs=1, cv = 5)
    scores.append(np.mean(this_scores))
    scores_std.append(np.std(this_scores))

# Do the plotting
plt.figure(1, figsize=(10, 8))
plt.clf()
plt.semilogx(C_s, scores)
plt.semilogx(C_s, np.array(scores) + np.array(scores_std), 'b--')
plt.semilogx(C_s, np.array(scores) - np.array(scores_std), 'b--')
locs, labels = plt.yticks()
plt.yticks(locs, list(map(lambda x: "%g" % x, locs)))
plt.ylabel('CV score')
plt.xlabel('Parameter C')
plt.ylim(0.82, 1.04)
plt.show()
png
Alternatively, we can use the GridSearchCV to do the same thing:
from sklearn.model_selection import GridSearchCV

params = dict(svm__C=np.linspace(0.001, 1000, 100))

grid_search = GridSearchCV(estimator=pipe, param_grid=params,n_jobs=-1, cv=5)

grid_search.fit(X,y)
GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('standardize', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm', SVC(C=1000.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'svm__C': array([  1.00000e-03,   1.01020e+01, ...,   9.89899e+02,   1.00000e+03])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)
grid_search.best_score_ 
0.97333333333333338
grid_search.best_params_
{'svm__C': 10.102}
You can see all the results in grid_search.cv_results_

Exercise

Using the grid_search.cv_results_ from the GridSearchCV, plot the same figure as above which showing the parameter C vs. CV score. 
# Do the plotting
plt.figure(1, figsize=(10, 8))
plt.clf()

C_s = grid_search.cv_results_['param_svm__C'].data
scores = grid_search.cv_results_['mean_test_score']
scores_std = grid_search.cv_results_['std_test_score']

plt.semilogx(C_s, scores)
plt.semilogx(C_s, np.array(scores) + np.array(scores_std), 'b--')
plt.semilogx(C_s, np.array(scores) - np.array(scores_std), 'b--')
locs, labels = plt.yticks()
plt.yticks(locs, list(map(lambda x: "%g" % x, locs)))
plt.ylabel('CV score')
plt.xlabel('Parameter C')
plt.ylim(0.82, 1.04)
plt.show()
png