Friday, December 29, 2017

Use K-D Tree to query points - Part 2 (Use geo-coordinates and real distances)

In the previous post, we talked about using KD-tree to find the closest points from a reference point in a group. But we notice that it is using the Euclidean distance. In reality, we use the points on the earth, and want to use the real distance between two points on the earth (at least for me, this is usually the case). Therefore, we need to use the KD-tree with the real distances we usually used in life. In this post, I will tell you how to do it. You can find all the code on Qingkai’s Github.
from scipy import spatial
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
%matplotlib inline

plt.style.use('seaborn-poster')
Let’s first generate a grid of data points in Bay Area with latitude/longitude step 0.1 degree. We can choose the reference point at the center of the map. See the following figure:
# define map ranges
llat = 37.0
ulat = 38.5
llon = -123
ulon = -121.5

# generate grid points
lats = np.arange(llat, ulat, 0.1)
lons = np.arange(llon, ulon, 0.1)
lons, lats = np.meshgrid(lons, lats)

# flatten the locations
lons_1d = lons.flatten()
lats_1d = lats.flatten()

# reference point as the center of the map
lat_0 = (llat + ulat) / 2.
lon_0 = (llon + ulon) / 2.
plt.figure(figsize=(10,10))
m = Basemap(projection='merc', lon_0=-125.36929, lat_0=38.3215, 
        llcrnrlon=llon,llcrnrlat=llat- 0.01,urcrnrlon=ulon,urcrnrlat=ulat + 0.01,resolution='h')
m.drawcoastlines()
m.drawmapboundary()

m.drawparallels(np.arange(llat, ulat + 0.01, (ulat - llat)/2.), labels=[1,0,0,0], linewidth=0.1, fmt='%.1f')
m.drawmeridians(np.arange(llon, ulon + 0.01, (ulon - llon)/2.), labels=[0,0,0,1], linewidth=0.1, fmt='%.1f') 

x_0, y_0 = m(lons_1d, lats_1d)
m.plot(x_0, y_0, 'bo', markersize=10) 

x_0, y_0 = m(lon_0, lat_0)
m.plot(x_0, y_0, 'r*', markersize=30) 

m.fillcontinents()
m.drawmapscale(llon + 0.2, llat + 0.1, lon_0, lat_0, 30, 'fancy')

plt.show()
png

Find the points within 30 km from the reference point

Now, let’s find the points within 30 km from the reference points using KD-tree. I got the idea from this discussion at stackoverflow. The basic idea is to convert the latitude and longitude of the points to 3D cartesian coordinates and do the KD-tree query in this cartesian coordinates. We first define the functions to convert the points between the geo-coordinates and the cartesian coordinates.
from math import *

def to_Cartesian(lat, lng):
    '''
    function to convert latitude and longitude to 3D cartesian coordinates
    '''
    R = 6371 # radius of the Earth in kilometers

    x = R * cos(lat) * cos(lng)
    y = R * cos(lat) * sin(lng)
    z = R * sin(lat)
    return x, y, z

def deg2rad(degree):
    '''
    function to convert degree to radian
    '''
    rad = degree * 2*np.pi / 360
    return(rad)

def rad2deg(rad):
    '''
    function to convert radian to degree
    '''
    degree = rad/2/np.pi * 360
    return(degree)

def distToKM(x):
    '''
    function to convert cartesian distance to real distance in km
    '''
    R = 6371 # earth radius
    gamma = 2*np.arcsin(deg2rad(x/(2*R))) # compute the angle of the isosceles triangle
    dist = 2*R*sin(gamma/2) # compute the side of the triangle
    return(dist)

def kmToDIST(x):
    '''
    function to convert real distance in km to cartesian distance 
    '''
    R = 6371 # earth radius
    gamma = 2*np.arcsin(x/2./R) 
    
    dist = 2*R*rad2deg(sin(gamma / 2.))
    return(dist)
# convert the grid points and reference points to cartesian coordinates
x, y, z = zip(*map(to_Cartesian, lats_1d, lons_1d))
x_ref, y_ref, z_ref = to_Cartesian(lat_0, lon_0)
# convert the 30 km to cartesian coordinates distance
dist = kmToDIST(30)

# create the KD-tree using the 3D cartesian coordinates
coordinates = list(zip(x, y, z))
tree = spatial.cKDTree(coordinates)

# get all the points within 30 km from the reference point
ix = tree.query_ball_point((x_ref, y_ref, z_ref), dist)

# plot them on the map
plt.figure(figsize=(10,10))
m = Basemap(projection='merc', lon_0=-125.36929, lat_0=38.3215, 
        llcrnrlon=llon,llcrnrlat=llat- 0.01,urcrnrlon=ulon,urcrnrlat=ulat + 0.01,resolution='h')
m.drawcoastlines()
m.drawmapboundary()

m.drawparallels(np.arange(llat, ulat + 0.01, (ulat - llat)/2.), labels=[1,0,0,0], linewidth=0.1, fmt='%.1f')
m.drawmeridians(np.arange(llon, ulon + 0.01, (ulon - llon)/2.), labels=[0,0,0,1], linewidth=0.1, fmt='%.1f') 

x_0, y_0 = m(lons_1d, lats_1d)
m.plot(x_0, y_0, 'bo', markersize=10, label = 'Grid point') 

x_0, y_0 = m(lon_0, lat_0)
m.plot(x_0, y_0, 'r*', markersize=30, label = 'Ref. point') 

x_0, y_0 = m(lons_1d[ix], lats_1d[ix])
m.plot(x_0, y_0, 'ro', markersize=10, label = 'Point within 30 km') 

m.fillcontinents()
m.drawmapscale(llon + 0.2, llat + 0.1, lon_0, lat_0, 30, 'fancy')

plt.legend(numpoints = 1)

plt.show()
png

Find the distances for the closest 10 points

We can also find the distances from the closest 10 points. This will be very simple by using the query function.
# get the cartesian distances from the 10 closest points
dist, ix = tree.query((x_ref, y_ref, z_ref), 10)

# print out the real distances in km
print(map(distToKM, dist))
[7.849314101994533, 7.849314102001198, 7.859304141667372, 7.859304141674064, 17.516008198378657, 17.516008198387585, 17.547633149227252, 17.547633149230194, 17.55621138531865, 17.556211385327607]

Wednesday, December 27, 2017

Use K-D Tree to query points - Part 1

We have a group of 2D points, for example, group A, contains 50 data points. We have another point B (or another group of points), that we want to find n points in A that are closest to the point in B. How do we do that? The simplest thinking is that, we can calculate the distances of all the points from A to B and decide which are the closest n points.
But wait a minute, this sounds we need to do a lot of work. For example, if we want to find the 1 closest countries from US, do we need to calculate all the countries in the southern hemisphere? Are there any good ways that we don’t need to calculate all the distances? Yes, there is a way! This is what I want to show this week - K-D tree. I won’t go into details of this algorithm, if you are interested, you can find the intuition here.
I will use a python example to show how easy to use K-D tree to do the job. You can find the code on Qingkai's Github. But you will notice that the distance here is just Euclidean distance, what if we want to use the distance on earth? We will show that in part 2 in the next post. 
from scipy import spatial
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

plt.style.use('seaborn-poster')
Let’s first generate 50 x and y points in group A and for simplicity, generate one point in group B. We will calculate the 3 closest data points from A to B.
np.random.seed(100)
A = np.random.random((50,2))*100

B = [50, 50]
We first build the K-D tree using the function in scipy. And query the 3 closest points to B. Note that, the distance here is the Euclid distance.
# we use cKDTree instead of KDTree, since it is a C version, will be faster. 
kdtree = spatial.cKDTree(A)

# it returns the distance and the index of the points, 3 means we want the top 3 cloest points
dist, ix = kdtree.query(B, 3)
plt.plot(A[:, 0], A[:, 1], 'o', label = 'Point in A')
plt.plot(B[0], B[1], 'r*', markersize = 30, label = 'Point B')
plt.plot(A[:, 0][ix], A[:, 1][ix], 'ro', label = '3 cloest Points')
plt.legend(numpoints = 1)
plt.show()
png

Find all points within certain distance

We can also ask the question: ‘what are the points within distance 20 from B’. Here is how we do it:
# let's find all the points within distance 20 from point B
ix_list = kdtree.query_ball_point(B, 20)
plt.plot(A[:, 0], A[:, 1], 'o', label = 'Point in A')
plt.plot(B[0], B[1], 'r*', markersize = 30, label = 'Point B')
plt.plot(A[:, 0][ix_list], A[:, 1][ix_list], 'ro', label = 'Points within 20')
plt.legend(numpoints = 1)
plt.show()
png

Find all pairs within certain distance

If you have another group of points, you can find all the points in one group within distance 10 from another group. I will just use group A to itself, basically, we will find all the pairs of data points that has distance within 10. You can see that for each point, itself is included in the returned index, because the distance is 0.
kdtree.query_ball_tree(kdtree, 10)
[[0, 45],
 [1, 10],
 [2],
 [3],
 [4, 22],
 [5, 8, 9],
 [6],
 [7, 36],
 [5, 8, 23],
 [5, 9, 11, 23],
 [1, 10],
 [9, 11, 23, 38],
 [12, 16, 46],
 [13],
 [14],
 [15],
 [12, 16, 39, 46],
 [17],
 [18],
 [19, 37],
 [20],
 [21, 40],
 [4, 22],
 [8, 9, 11, 23, 38],
 [24],
 [25, 32],
 [26, 30, 44],
 [27, 42],
 [28, 35],
 [29, 39],
 [26, 30, 44],
 [31],
 [25, 32],
 [33],
 [34, 35],
 [28, 34, 35],
 [7, 36],
 [19, 37],
 [11, 23, 38],
 [16, 29, 39],
 [21, 40],
 [41],
 [27, 42],
 [43],
 [26, 30, 44, 48],
 [0, 45],
 [12, 16, 46],
 [47],
 [44, 48, 49],
 [48, 49]]

Sunday, December 24, 2017

Create discrete colormap

Many times, we need different color scales to emphasize different parts. But sometimes, we want to specify some color maps by ourselves, there are ways to do this. Also, sometimes, we want to specify some colors bins, and if my value falls into the bin, I use one specific color. For example, I want the following color schema:
  • 0 <= v < 1, white
  • 1 <= v < 4, yellow
  • 4 <= v < 8, brown
  • 8 <= v < 20, orange
  • 20 <= v < 50, magenta
  • 50 <= v < 80, #FF0033 (sort of red)
  • 80 <= v < 100, #721980 (purple)
In the following, I will give an example, you can find the notebook on Qingkai's Github
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as colors
%matplotlib inline
plt.style.use('seaborn-poster')
# generate x and y value between [0, 100)
np.random.seed(10)
x = np.random.randint(low = 0, high=100, size=200)
y = np.random.randint(low = 0, high=100, size=200)
# set colors
colorsList = ['white', 'yellow', 'brown', 'orange', 'magenta', '#FF0033', '#721980']
CustomCmap = colors.ListedColormap(colorsList)
bounds = np.array([0, 1, 4, 8, 20, 50, 80, 100])
norm = colors.BoundaryNorm(boundaries=bounds, ncolors=7)
plt.scatter(x,y,c=y, s = 80, cmap=CustomCmap, norm = norm)
plt.colorbar()
plt.ylim(-1,101)
plt.xlim(-1,101)
plt.show()
png

Thursday, December 21, 2017

What is ablation study in machine learning

We often come across 'ablation study' in machine learning papers, for example, in this paper with the original R-CNN, it has a section of ablation studies. But what does this means?

Well, we know that when we build a model, we usually have different components of the model. If we remove some component of the model, what's the effect on the model? This is a very coarse definition of ablation study - we want to see the contributions of some proposed components in the model by comparing the model including this component with that without this component.

In the above paper, in order to see the effect of fine-tuning of the CNN, the authors analyzed the performance of the model with the fine-tuning and the performance of it without the fine-tuning. This way, we can easily see the effect of the fine-tuning.

The following I copied from the answer of Jonathan Uesato on Quora, it explains very well:

An ablation study typically refers to removing some “feature” of the model or algorithm and seeing how that affects performance.
Examples:
  • An LSTM has 4 gates: feature, input, output, forget. We might ask: are all 4 necessary? What if I remove one? Indeed, lots of experimentation has gone into LSTM variants, the GRU being a notable example (which is simpler).
  • If certain tricks are used to get an algorithm to work, it’s useful to know whether the algorithm is robust to removing these tricks. For example, DeepMind’s original DQN paper reports using (1) only periodically updating the reference network and (2) using a replay buffer rather than updating online. It’s very useful for the research community to know that both these tricks are necessary, in order to build on top of these results.
  • If an algorithm is a modification of a previous work, and has multiple differences, researchers want to know what the key difference is.
  • Simpler is better (inductive prior towards simpler model classes). If you can get the same performance with two models, prefer the simpler one.

Sunday, December 17, 2017

AGU 2017 at New Orleans

I was at AGU 2017 this week, it was held at New Orleans this time (finally not in San Fransisco anymore). It is a great conference overall, and here are some photos I took:




















Thursday, November 30, 2017

Fun: Father and Ice Cream

Recently, my daughter (3.5-year-old) draws a picture, which she calls 'Father and Ice Cream' (see the image below). I know in her eyes, both Father and Ice Cream are important. I feel the drawing of me looks like a carrot (according to one of my friend), but at least I can tell which one is me, which one is ice cream. Maybe she inherits her mom's drawing skills.


Sunday, November 19, 2017

Saturday, November 18, 2017

Python: Grab useful data from tables on webpage

If you want to grab data from a website, where data is stored in each day (different pages). Python is really a good tool to do that for many pages instead you go through them one by one copy it. This week, one friend asked me about exactly the problem and here is how we did it. You can find all the code on Qingkai's Github

The problem

We want to download data from this website, the Sea Level Pressure value from the webpage and save it to the csv file, for example, the red part in the following figure:
png

Grab the value from the website

import urllib2
import datetime
import numpy as np
from BeautifulSoup import BeautifulSoup
First we will define a function that takes in year, month and day, and generate the url for the website for that day. If you look at the website url, for example: 'https://www.wunderground.com/history/airport/KICT/2017/1/1/DailyHistory.html?req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo=', this website is for 2017-01-01, it is easier to think the website is like in the folders, the path to the folder is this link, if we want data from 2017-01-02, we only need to change it to 'https://www.wunderground.com/history/airport/KICT/2017/1/2/DailyHistory.html?req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo='. Now you understand why we will pass the year, month, day to the function to generate the url.
def generate_url(year, month, day):
    base_url = 'https://www.wunderground.com/history/airport/KICT/%d/%d/%d/DailyHistory.html?&reqdb.zip=&reqdb.magic=&reqdb.wmo='%(year, month, day)
    return base_url
Next, we generate a list of dates from the starting date to the ending date
start = datetime.datetime.strptime("2017-01-01", "%Y-%m-%d")
end = datetime.datetime.strptime("2017-01-10", "%Y-%m-%d")
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days)]
print(date_generated)
[datetime.datetime(2017, 1, 1, 0, 0), datetime.datetime(2017, 1, 2, 0, 0), datetime.datetime(2017, 1, 3, 0, 0), datetime.datetime(2017, 1, 4, 0, 0), datetime.datetime(2017, 1, 5, 0, 0), datetime.datetime(2017, 1, 6, 0, 0), datetime.datetime(2017, 1, 7, 0, 0), datetime.datetime(2017, 1, 8, 0, 0), datetime.datetime(2017, 1, 9, 0, 0)]
Now it is the fun part, we will use the urllib module to open the url and use the BeautifulSoup module to parse the html webpage. Let’s first do this for one page: 2017-01-01.
url = 'https://www.wunderground.com/history/airport/KICT/2017/1/1/DailyHistory.html?req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo='

response = urllib2.urlopen(url)
soup = BeautifulSoup(response)
The way we can extract the information is that we check the page source code - html in this case, and find the information we want. Go to the webpage, and right click on the page, and choose ‘View Page Source’. You will see the raw html for the webpage. Browse through the source page, you will see find that the data we want stored in a table, see the following figure.
png
The red rectangle highlight the part we want, an HTML table is defined with the <table> tag. Each table row is defined with the <tr> tag. A table header is defined with the <th> tag. By default, table headings are bold and centered. A table data/cell is defined with the <td> tag. You can find more details here.
Now, we have an idea how the data is structured, and how we can extract it: we first find the table we want (there are multiple tables in the page). Then we find all the rows in the table, and loop through it see if it contains the 'Sea Level Pressure', if it does, we then extract the <span class> with an value "wx-value". By browsing through the html, we find the data we want stored in the table with a class 'responsive airport-history-summary-table' or id 'historyTable', which we can use this information to search.
table = soup.find('table', {'id': 'historyTable'})
You can print out the table content and see what is inside (you will see the raw html as the image shown above). We then loop through each row of this table and find the row which contains ‘Sea Level Pressure’:
# we loop through each row
for row in table.findAll('tr'):
    # find all the cells in the row
    data = row.findAll('td')
    
    # check if the 'Sea Level Pressure'
    if data and 'Sea Level Pressure' in data[0].text:

        if data[1].find('span'):
            pressure = data[1].find('span', attrs={'class' : 'wx-value'}).text
            
print pressure
29.85
The above code starts with table.findAll(‘tr’), which basically saying we want to find all the rows in this table and return them in a list. Then we loop through each row and find all the cells in the row, that is row.findAll(‘td’). The if statement checks whether the first cell in the data contains “Sea Level Pressure” as the title or not, if it does, we will find the next cell which contains a span class with ‘wx-value’ in it, that is the value we want. Also you notice that I have another if statement to check if data[1].find(‘span’), we want to see if the cell contains span or not, you will see that some of the cell contains “Sea Level Pressure” but no span (that is the title in the table).
Anyway, we successfully extract the pressure value from the table, and let’s put everything together by defining a function to return the pressure from an url, and then we can loop through all the dates.
def get_pressure(url):
    response = urllib2.urlopen(url)
    soup = BeautifulSoup(response)
    
    table = soup.find('table', {'id': 'historyTable'})
    
    for row in table.findAll('tr'):
        data = row.findAll('td')
        if data and 'Sea Level Pressure' in data[0].text:
            
            if data[1].find('span'):
                pressure = data[1].find('span', attrs={'class' : 'wx-value'}).text
                
    return pressure
results = []

for date in date_generated:
    year = date.year
    month = date.month
    day = date.day
    
    # generate the url for each day
    url = generate_url(year, month, day) 
    date_str = date.strftime('%Y-%m-%d')
    pressure = get_pressure(url)
    results.append([date_str, float(pressure)]) 

print(results)
# you can also save it as csv
#np.savetxt('pressure.csv', np.array(results), fmt = '%s', delimiter=',', header='date, pressure(in)')
[['2017-01-01', 29.85], ['2017-01-02', 29.78], ['2017-01-03', 30.18], ['2017-01-04', 30.24], ['2017-01-05', 30.24], ['2017-01-06', 30.43], ['2017-01-07', 30.65], ['2017-01-08', 30.43], ['2017-01-09', 29.97]]

Get all the table values

Ok, till now, I think you already knew how to extract values from a cell in a table. Let’s do another test: If you scroll down two the bottom of the page, you will see a table with about each hour measurements (see the following figure), now we want to get each day, each hour sea level pressure measurements. One example is:
png
I think I don’t need to explain more, see the code below, you can figure it out :-)
# To save space, let's only get two days data
start = datetime.datetime.strptime("2017-01-01", "%Y-%m-%d")
end = datetime.datetime.strptime("2017-01-03", "%Y-%m-%d")
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days)]

results = []
for date in date_generated:
    year = date.year
    month = date.month
    day = date.day
    
    # generate the url for each day
    url = generate_url(year, month, day) 
    # keep a string for that date
    date_str = date.strftime('%Y-%m-%d')
    
    response = urllib2.urlopen(url)
    soup = BeautifulSoup(response)
    
    # by looking at the html, we found the table name is 'obs-table responsive'
    table = soup.find('table', {'class': 'obs-table responsive'})
    
    # loop through the table row without the first row (that is the header)
    for row in table.findAll('tr')[1:]:
        # find all the cells
        cells = row.findAll('td')
        
        # we can see easily the time is in the first cell, and the pressure value is in the 5th cell
        results.append([date_str, cells[0].text, cells[5].find('span', attrs={'class' : 'wx-value'}).text])

print("\n".join(' '.join(map(str,sl)) for sl in results))      
#np.savetxt('pressure.csv', np.array(results), fmt = '%s', delimiter=',', header='date, time (CST), pressure (in)')
2017-01-01 12:53 AM 29.99
2017-01-01 1:53 AM 29.97
2017-01-01 2:53 AM 29.94
2017-01-01 3:53 AM 29.95
2017-01-01 4:53 AM 29.93
2017-01-01 5:53 AM 29.95
2017-01-01 6:53 AM 29.96
2017-01-01 7:53 AM 29.96
2017-01-01 8:53 AM 29.96
2017-01-01 9:53 AM 29.94
2017-01-01 10:53 AM 29.92
2017-01-01 11:53 AM 29.89
2017-01-01 12:53 PM 29.84
2017-01-01 1:31 PM 29.80
2017-01-01 1:53 PM 29.81
2017-01-01 2:41 PM 29.77
2017-01-01 2:53 PM 29.77
2017-01-01 3:53 PM 29.77
2017-01-01 4:53 PM 29.76
2017-01-01 5:53 PM 29.77
2017-01-01 6:53 PM 29.77
2017-01-01 7:53 PM 29.78
2017-01-01 8:53 PM 29.79
2017-01-01 9:13 PM 29.77
2017-01-01 9:53 PM 29.78
2017-01-01 10:00 PM 29.77
2017-01-01 10:46 PM 29.77
2017-01-01 10:53 PM 29.78
2017-01-01 11:37 PM 29.77
2017-01-01 11:53 PM 29.78
2017-01-02 12:20 AM 29.76
2017-01-02 12:53 AM 29.76
2017-01-02 1:53 AM 29.75
2017-01-02 2:53 AM 29.76
2017-01-02 3:38 AM 29.75
2017-01-02 3:53 AM 29.75
2017-01-02 4:53 AM 29.73
2017-01-02 5:32 AM 29.72
2017-01-02 5:53 AM 29.71
2017-01-02 6:18 AM 29.72
2017-01-02 6:53 AM 29.73
2017-01-02 7:53 AM 29.76
2017-01-02 8:42 AM 29.75
2017-01-02 8:53 AM 29.76
2017-01-02 9:06 AM 29.74
2017-01-02 9:43 AM 29.75
2017-01-02 9:53 AM 29.77
2017-01-02 10:51 AM 29.76
2017-01-02 10:53 AM 29.76
2017-01-02 11:16 AM 29.77
2017-01-02 11:36 AM 29.75
2017-01-02 11:53 AM 29.75
2017-01-02 12:53 PM 29.73
2017-01-02 1:53 PM 29.71
2017-01-02 2:53 PM 29.71
2017-01-02 3:26 PM 29.71
2017-01-02 3:53 PM 29.73
2017-01-02 4:53 PM 29.75
2017-01-02 5:53 PM 29.77
2017-01-02 6:51 PM 29.80
2017-01-02 6:53 PM 29.80
2017-01-02 7:08 PM 29.81
2017-01-02 7:20 PM 29.82
2017-01-02 7:29 PM 29.83
2017-01-02 7:53 PM 29.83
2017-01-02 8:11 PM 29.85
2017-01-02 8:25 PM 29.86
2017-01-02 8:53 PM 29.88
2017-01-02 9:48 PM 29.89
2017-01-02 9:53 PM 29.89
2017-01-02 10:43 PM 29.90
2017-01-02 10:53 PM 29.91
2017-01-02 11:18 PM 29.90
2017-01-02 11:53 PM 29.91