Wednesday, March 30, 2016

Mining Gmail using Python (only timestamp)

Today I had a presentation at The Hacker Within Berkeley about a simple mining of your Gmail emails arriving/sending time using python. You can download the notebook at Qingkai's Github. The idea is to find some patterns from the emails I received/sent. I only use the timestamp associated with the emails, but you can use other information as well. The data I used is the entire inbox and sent box, which happens to be my PhD life, since I registered my Gmail when I started my PhD here.

1 Distribution of the emails through out the day
The following two figures shows the distribution of my incoming/outgoing emails during the day, I have 3 peaks, one in the late morning, one in the afternoon, and one before sleep.
2 Distribution of the emails through out the period
The following two figures shows when the emails comes in from 2012 - 2016, some imedieate features are:
(1) we see gaps for the summer/winter breaks
(2) some emails come in at fixed time from automatic services
(3) I do seem have more emails with time progress


3 Aggregate the count 
The following two figures shows the number of emails aggregated by day and by month. You can see the high frequency changes on the daily plot, and the low frequency on the monthly plot. It seems the outgoing emails correlate the incoming emails very well. And indeed, there's a trend that I received more and more emails.

Regression
The following figure shows the relationship between the incoming and outgoing emails. It seems that for every 100 emails I received, and I will sent about 28 emails.

Finding the repeating pattern
The following figure shows the FFT spectrum of the emails I sent out. And I also labeled the top 5 peaks which corresponding to 1 week, half year, full year, 8 months, and half week.  Can you tell why they repeat?
6 Do I changed my sending email behavior?
Looking at the first figure, do you think I changed my behavior of sending emails? We can model this data and try to answer this question by using the Bayesian analysis (see the notebooks for details). The basic idea is to find a month that may mark as the switch point, that the months before it I sent emails differently from that after it. So using a poisson distribution to model the count data, and a uniform distribution to reflect we have no information for this switch point at all, we did a MCMC sampling of the posterior distribution of these parameters, which shows in the second figure, and we can see it point to the 20th month as my switch month. This happens to be the time when I finish my qualify exam, and started a new semester. Now you can see that PhD life is totally different before and after the qualify exam, and it even shown on my email data!!!




No comments:

Post a Comment