Friday, September 15, 2017

Fun: Places I hiked before (Qingkai's Hiking Map)

I love hiking, and before we had our kids, we usually go hiking or camping every weekend. Even with our daughter, we still try to go out every weekend with her on my back. We once had an ambitious plan to hike in all the parks in Bay Area before I graduate,  but now it seems hard to achieve the goal!


In the following, I plot all my hiked places in Bay Area (Plus some in Northern CA) on a map, and many places we went more than once! I am really impressed by how many places we visited in the last few years, and hope we will increase more hiking dots on the map.

 

Friday, September 8, 2017

Tricks: How to search in Gmails

Many times we need to search some emails in the Gmail, but instead of simply searching for a name or key word, there are many useful tricks in Gmail that can save you a lot of time. Here are some of the tricks I use all the time for search in my life. 

Using different operators

  • Finding emails with Obama in subject line
subject:Obama
  • Finding the unread emails from Obama with attachment that emails
is:unread from:Obama has:attachment
  • Finding emails larger than 10 MB in size: 
size:10MB
  • Finding emails larger than 5 MB and smaller than 10 MB in size: 
larger_than:5m smaller_than:10M 
  • Finding emails larger than 5 MB and older than 1 year
larger:5m older_than:1y
  • Finding emails with word document as the attachment
filename:.doc
  • Finding all the emails with attachment in drafts
in:drafts has:attachment 
  • Finding emails between two dates
after:2011/08/22 before:2011/08/31 

Combining operators

By default, Gmail combines with multiple operators with 'AND', for example, the above one is finding the emails after 2011/08/22 AND 2011/08/31. But there are more options:
  • Search exact phrase
"good day"
Note that, if you search only for good day, you will get results from emails contain both good and day in them, but they may not contain good day phrase. 
  • Search emails containing either Iron man OR Spider man
"Iron man" OR "Spider man"
  • Search emails not containing day but have good in it
good -day
  • Search emails have dinner and movie in the subject line
subject:(dinner movie)
This will find emails have both dinner and movie in the subject line, but they are maybe not a phrase.

Monday, September 4, 2017

Article: Seismic Data from Smartphones - MyShake: Building a Global Smartphone Seismic Network

Our new article - Seismic Data from Smartphones - MyShake: Building a Global Smartphone Seismic Network just came out at GeoStrata, and it gives an overview of MyShake system and the potential use of MyShake in the engineering communities. Have a look at it, you can download this article from here



Sunday, August 27, 2017

Book Review: A Student's Guide to Waves

Recently, I read the 'A Student's Guide to Waves' by Danel Fleisch (he has different student's guides, all very good, check it out), it is such a pleasant reading that I wish I have read it when I first studied waves. I recommend anyone wants to learn waves, or have already learned to go through this book (you will find it you go through it very fast). It is truly a student's guide and if in the future I will teach this subject, I am sure, this will be my class text ^)^

The very nice part of this book is that it explains everything in plain English. All the concepts and equations are explained like reading a story that you just want to follow with the author to understand deeper. Besides, the book is only ~200 pages, and each section is short, makes it a book that you can read anywhere (I actually read this book mostly on the flight or on Bart). The author has very deep understanding of the subject that he gives a lot of the nice explanation that I never read from other books (I am a student in Seismology, I read many books talking about the mechanical waves, but most of the time, I finish the book with more confused view about waves, it took me long time to understand it).

This book starts with the fundamentals of waves, concepts like the wavenumber, complex numbers, Euler relations, wavefunctions, etc. are introduced here. These are basics for learning more of the waves. The author did very nice job showing how did these concepts come up, and accompany with the figures, these concepts become very clear.

Afterwards, the book talks about the wave equation. How the wave equation derived in a simple way, and why it is the 2nd partial derivative are all nicely explained it here. Also, there are many details in the equations that we often ignore but pointed out by the author which help us to understand better of the subject.

Later, the book gives the general solutions to the wave equation and the importance of the boundary conditions. After all these, the Fourier synthesis and Fourier analysis are discussed with the aids of many figures that you will find that the important Fourier synthesis and analysis are really simple and will store into your mind forever. It even talks about the 'uncertainty principle' between the time/frequency domain and the distance/wavenumber domain that dominant many analysis in practice.

The last part of the book deals with specific types of waves, i.e. mechanical wave equation, electromagnetic wave equation and the quantum wave equation. Armed with the concepts and equations you learned before, you will find how to apply them to specific types of waves in the real world to address some of the interesting problems. Even though I am a seismologist, and mostly interested in the mechanical waves, but I found the electromagnetic and quantum wave equations are also very interesting. I was so impressed by the way all the nature phenomenon links to wave equation in various forms.

Overal, it is a great short book that suitable for beginners or more advanced researchers.






Saturday, July 29, 2017

Machine learning 17: Using scikit-learn Part 5 - Common practices

The material is based on my workshop at Berkeley - Machine learning with scikit-learn. I convert it here so that there will be more explanation. Note that, the code is written using Python 3.6. It is better to read the slides I have first, which you can find it here. You can find the notebook on Qingkai's Github. 
This week, we will discuss some common practices that we skipped in the previous weeks. These common practices will help us to train a model that generalize well, that is perform well on the new data that we want to predict. 
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('seaborn-poster')

%matplotlib inline

Classification Example

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import preprocessing
#get the dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Split the dataset into a training and a testing set
# Test set will be the 25% taken randomly
X_train, X_test, y_train, y_test = train_test_split(X, y,
 test_size=0.25, random_state=33)
print(X_train.shape, y_train.shape)
(112, 4) (112,)
X_train[0]
array([ 5. ,  2.3,  3.3,  1. ])
Let's standardize the input features
# Standardize the features
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
X_train[0]
array([-0.91090798, -1.59761476, -0.15438202, -0.14641523])
#Using svm
from sklearn.svm import SVC
clf = SVC()
clf.fit(X_train, y_train)

clf.score(X_test, y_test)
0.94736842105263153

Pipeline

We can use pipeline to chain all the operations into a simple pipeline:
from sklearn.pipeline import Pipeline

estimators = []
estimators.append(('standardize', preprocessing.StandardScaler()))
estimators.append(('svm', SVC()))
pipe = Pipeline(estimators)

pipe.fit(X_train, y_train)

pipe.score(X_test, y_test)
0.94736842105263153
When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set. However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets. A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”: A model is trained using k-1 of the folds as training data; the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy). The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as it is the case when fixing an arbitrary test set), which is a major advantage in problem such as inverse inference where the number of samples is very small.

Computing cross-validated metrics

The simplest way to use cross-validation is to call the crossvalscore helper function on the estimator and the dataset.
from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipe, X, y, cv=5)
scores 
array([ 0.96666667,  0.96666667,  0.96666667,  0.93333333,  1.        ])
The mean score and the 95% confidence interval of the score estimate are hence given by:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()))
Accuracy: 0.97 (+/- 0.02)
It is also possible to use other cross validation strategies by passing a cross validation iterator instead, for instance:
from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0)
cross_val_score(pipe, iris.data, iris.target, cv=cv)
array([ 0.97777778,  0.93333333,  0.95555556])

Using cross-validation choose parameters

For example, if we want to test different value of C vlaues for the SVM, we can run the following code and decide the best parameter. We can have a look of all the parameters we used in our pipeline by using get_params function. 
pipe.get_params()
{'standardize': StandardScaler(copy=True, with_mean=True, with_std=True),
 'standardize__copy': True,
 'standardize__with_mean': True,
 'standardize__with_std': True,
 'steps': [('standardize',
   StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('svm', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
     decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
     max_iter=-1, probability=False, random_state=None, shrinking=True,
     tol=0.001, verbose=False))],
 'svm': SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False),
 'svm__C': 1.0,
 'svm__cache_size': 200,
 'svm__class_weight': None,
 'svm__coef0': 0.0,
 'svm__decision_function_shape': None,
 'svm__degree': 3,
 'svm__gamma': 'auto',
 'svm__kernel': 'rbf',
 'svm__max_iter': -1,
 'svm__probability': False,
 'svm__random_state': None,
 'svm__shrinking': True,
 'svm__tol': 0.001,
 'svm__verbose': False}
C_s = np.linspace(0.001, 1000, 100)

scores = list()
scores_std = list()
for C in C_s:
    pipe.set_params(svm__C = C)
    this_scores = cross_val_score(pipe, X, y, n_jobs=1, cv = 5)
    scores.append(np.mean(this_scores))
    scores_std.append(np.std(this_scores))

# Do the plotting
plt.figure(1, figsize=(10, 8))
plt.clf()
plt.semilogx(C_s, scores)
plt.semilogx(C_s, np.array(scores) + np.array(scores_std), 'b--')
plt.semilogx(C_s, np.array(scores) - np.array(scores_std), 'b--')
locs, labels = plt.yticks()
plt.yticks(locs, list(map(lambda x: "%g" % x, locs)))
plt.ylabel('CV score')
plt.xlabel('Parameter C')
plt.ylim(0.82, 1.04)
plt.show()
png
Alternatively, we can use the GridSearchCV to do the same thing:
from sklearn.model_selection import GridSearchCV

params = dict(svm__C=np.linspace(0.001, 1000, 100))

grid_search = GridSearchCV(estimator=pipe, param_grid=params,n_jobs=-1, cv=5)

grid_search.fit(X,y)
GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('standardize', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm', SVC(C=1000.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'svm__C': array([  1.00000e-03,   1.01020e+01, ...,   9.89899e+02,   1.00000e+03])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)
grid_search.best_score_ 
0.97333333333333338
grid_search.best_params_
{'svm__C': 10.102}
You can see all the results in grid_search.cv_results_

Exercise

Using the grid_search.cv_results_ from the GridSearchCV, plot the same figure as above which showing the parameter C vs. CV score. 
# Do the plotting
plt.figure(1, figsize=(10, 8))
plt.clf()

C_s = grid_search.cv_results_['param_svm__C'].data
scores = grid_search.cv_results_['mean_test_score']
scores_std = grid_search.cv_results_['std_test_score']

plt.semilogx(C_s, scores)
plt.semilogx(C_s, np.array(scores) + np.array(scores_std), 'b--')
plt.semilogx(C_s, np.array(scores) - np.array(scores_std), 'b--')
locs, labels = plt.yticks()
plt.yticks(locs, list(map(lambda x: "%g" % x, locs)))
plt.ylabel('CV score')
plt.xlabel('Parameter C')
plt.ylim(0.82, 1.04)
plt.show()
png