Wednesday, 23 May 2012

Working Hours Graph - Life Logger

I’ve been using Life Logger to track how long I spend at work. It has a pretty rudimentary way to get summaries from it. I wanted to see a graph of how long I’ve spent at work in a graph so I exported the database and made my own.

Here’s what I Came Up With.

If my boss sees this, its all those public holidays that bring those hours down.

Here’s how I did it.

I exported the database from the app. I love how on android I can send it to drop box and pick it up on my computer.

Its a sqlite database, which I can easily manipulate using the sqlite tools on my computer.

The Timestamps are stored as milliseconds from the unix epoch rather than a normal sql date. The developer is must be using the java system.currenttimemillis().

The table Im interested in is this one.
CREATE TABLE tbl_log (_id INTEGER PRIMARY KEY AUTOINCREMENT,time_card_id_ref INTEGER NOT NULL, start_time INTEGER DEFAULT 0, end_time INTEGER DEFAULT 0, estimation INTEGER DEFAULT 0, note TEXT );

Dates can be pulled out with the sql query.
strftime("%Y-%M-%d",start_time/1000, "unixepoch") from tbl_log;

I can get the weeks of each entry with this entry. This gives year-week like 2012-2, which is good to group by.
select strftime("%Y-%W", start_time/1000, "unixepoch") from tbl_log;

I can get the hours worked in a week by using the sum function over a set like this.
select sum(end_time-start_time)/1000/60/60  from tbl_log where strftime("%Y-%W", start_time/1000, "unixepoch") = "2012-16";

This can be done for all weeks like this. The sum is an aggregate function which runs on the rows that are grouped by.
select strftime("%m/%d/%Y", start_time/1000, "unixepoch"), sum(end_time-start_time)/1000/60/60 as hours_worked from tbl_log group by strftime("%Y-%W", start_time/1000, "unixepoch");

which gives the times like this which can be imported into a spreadsheet and charted.
11/20/2011|34 
11/27/2011|35

Sunday, 6 May 2012

Ubuntu updates visualization.

I was wondering if there would be lots of updates to Ubuntu leading up to the release of 11/4 and thought I could data mine this to find out. So I made a visualization to see if that was the case, turns out the opposite is true. There has been a distinct lack of updates in the last couple of months. At least in the main repository.

Heres the visualization I made.

To do this I searched the changelogs for all packages from changelogs.ubuntu.com for the time stamps. Then I took the average of each week in the month to smooth it.

Heres how I did it.


#First download the changelogs.ubuntu website, only use the main pool. 
wget -m -np -r http://changelogs.ubuntu.com/changelogs/pool/main/
 #Search for all time stamps and print them out.
egrep -o -r -e "[A-Z][a-z]{2}, [0-9]{0,2} [A-Z][a-z]{2} [0-9]{4}"  > dates.main

I then wrote  a python script to count how many dates were in each week and average that over the month.
#!/usr/bin/env python
import sys,time
dates_in = {}
for line in sys.stdin.readlines():
    line = line.rstrip()
    if line in dates_in:
        dates_in[line] = dates_in[line]+1
    else:
        dates_in[line] = 1
month_counts = {}
for dateS,count in dates_in.items():
    date =time.strptime(dateS, "%d %b %Y")
    month_date = "%d/01/%d" % (date.tm_mon,date.tm_year)
    if month_date in month_counts:
        month_counts[month_date] = month_counts[month_date] + [count]
    else:
        month_counts[month_date] = [count]
for dateS,count in month_counts.items():
    print dateS+","+str(sum(count)/len(count))


Run that on the downloaded changelogs.
python data.py < dump/dates.main > weekly.avg.pr.month.csv 
 Then I imported that csv into google docs to make the graph.

Saturday, 14 April 2012

Problem with corrupt files and eCrypt.



I've been noticing some strange behaviour on my Ubuntu laptop lately. In Google Chrome I get logged out whenever I close the browser, even though set it to remember me. I've also seen a lot of these log messages from eCryptfs which encrypts my home directory.


Apr 14 10:29:41 stephen-ThinkPad-T520 kernel: [170037.012495] Valid eCryptfs headers not found in file header region or xattr region, inode 524280
Apr 14 10:29:41 stephen-ThinkPad-T520 kernel: [170037.012497] Either the lower file is not in a valid eCryptfs format, or the key could not be retrieved. Plaintext passthrough mode is not enabled; returning -EIO


I was concerned by these file system errors so did some digging and found the problem files by using "find -inum 524280". I found most of the files where to do with chrome, including chrome's local cookie store. This must have been why I was getting logged out. Seeing as these are temporary files I decided to delete them, chrome will re create them, and hopefully the new ones wont get corrupt or have any trouble with eCrypt again.

Here is the script I used to search for all of the corrupt files.

grep "Valid eCryptfs headers not found" /var/log/syslog | grep  -v 524280 | cut -b 140- > filestocheck

for i in $( cat filestocheck ); do find /home/stephen -inum $i -print; done | sort


Thursday, 8 March 2012

How to find an open source project to work on using github search.

You can search for popular projects that have TODO notes in the source code

To search for popular projects execute this search query for repositories.

followers:[1 TO *] followers:[1 TO *]
If you judge popularity by the number of followers.

To search for the TODO notes in that repo execute this search query for code.
TODO repo:django/django


Because the code is indexed differently to the repository there is no direct relationship from the code to the repositories followers. So this has to be done in two steps. But it shouldn't take long to find something to work on. This could be automated in using the Github’s API but it doesn’t support the advanced search criteria.


Happy hacking.

Saturday, 21 January 2012

Router bandwidth usage monitoring per user.

My flat has a shared internet connection, and we have a bandwidth cap, If we go over our cap it costs us all extra money.

There's this great little script wrtbwmon for DD-WRT that shows total bandwidth usage. There are other projects that do a nicer job but they require different router firmware, that doesn't support my router.

The main imitation of Wrtbwmon is that it doesn't show how usage is changing over time. If someone suddenly uses lots of our bandwidth cap I need to be able to tell them when it was. My solution to this is to dump the output from Wrtbwmon into a log. It'll be clumsy to work with, but hopefully I wont need to often.

Using a similar approach I could create graphs of the per user bandwidth usage using a javascript graphing library, but for now this will do.


Heres my script which I have running from a cron job on the router.


#!/bin/sh
WRTBWMON=/tmp/wrtbwmon
DB=/tmp/usage.db
PUB=/tmp/www/usage.html
SEND=/tmp/wrtbwmon_send

$WRTBWMON update $DB peak
$WRTBWMON publish $DB $PUB
date > $SEND
arp >> $SEND
cat $DB >> $SEND
echo -e "\n" >> $SEND

cat $SEND | ssh me@logging-server "cat >> wrtbwmon/log"

Wednesday, 18 January 2012

Concordion acceptance tests from Scala

Concordion is a great test automation framework written in Java. Scala is a great functional programming language which runs on the JVM. Scala can easily use Java code, which makes using Concordion from Scala very simple.

However there was one feature of Concordion that gave us some trouble. Java Concordion fixtures can return objects with public variables that Concordion can use to make assertions. But in Scala you cant really create public class variables. Scala creates public getter and setter methods with the name of the variable and keeps the variable private. This is all done under the hood of Scala so you don't notice it. Now when the Java code of Concordion tries to access those variables it instead finds the getter / setter methods and fails to make the assertion, throwing a very confusing error having found the method of the name of the variable that you were expecting.

Heres an example of the Concordion spec from the Concordion tutorial page, which calls the "split()" method returning the "result" object which has the public variables "firstName" and "lastName". This works fine from Java.

<span concordion:execute="#result = split(#TEXT)">John Smith</span>
                will be broken into first name
                <span concordion:assertEquals="#result.firstName">John</span>
                and last name
                <span concordion:assertEquals="#result.lastName">Smith</span>.
As a work around to this problem the accordion-scala project was created by Tom Malone and is supported by the Concordion creates linking to it from the Concordion website.


However my collogue, Jun Yamog, a true Scala guru, thought of a much simpler solution.


In the concordion fixture you can specify that the object value that your accessing is in fact a method, and it should be treated as such by simply adding the parenthesise to the end of the call, like so.

<span concordion:execute="#result = split(#TEXT)">John Smith</span>
                will be broken into first name
                <span concordion:assertEquals="#result.firstName()">John</span>
                and last name
                <span concordion:assertEquals="#result.lastName()">Smith</span>.
Which will work with the following Scala case class.

case class result(firstName: String, lastName: String)


This is a ridiculously  subtle work around, using the Scala generated getter/setter methods from Concordion in a way that was not documented to be possible. Thus the use of the accordion-scala port can be avoided.

For some complete code examples checkout this github repo https://github.com/stephennancekivell/concordion_scala_samples


Edit: It can also be done using Java bean properties. Thanks Tom Malone for explaining this. ognl converts the #result.firstName to #result.getFirstName(). That wasnt clear while debugging.
So like this ..

<span concordion:execute="#result = split(#TEXT)">John Smith</span>
                will be broken into first name
                <span concordion:assertEquals="#result.firstName">John</span>
                and last name
                <span concordion:assertEquals="#result.lastName">Smith</span>.
Which will work with the following Scala case class.

case class result(@BeanProperty firstName:String, @BeanProperty lastName:String)

Monday, 19 September 2011

What should a Distributed Operating System should look like and how close are we to making one


Overview - A Distributed Operating System

A Distributed operating system is a system that operates a distributed collection of computers, a cluster. Some require a single controller, but that shouldn't be necessary and isn't an interesting part of the problem. The system runs programmes and manages resources. The system should completely abstract the view of the hardware away from the software.


Fault tolerance, scalability and redundancy

The cluster may be heterogeneous, nodes may go offline, new nodes may come online. These are all issues that the system should handle. The system should dynamically adapt to changes in its configuration while continuing to function without error. At all times the system should utilize the cluster to its full potential. Nothing of value should be lost when a node does go down, thus redundancy needs to be built into the system.

Abstraction

For a distributed operating system, abstraction of the whole computer cluster needs to be considered. The application programmer needs the ability to create threads, manage the communication between them and access resources. The actual distribution and synchronisation should to be managed by the system.

Intro to components

Compute power distribution

Distributing computational tasks between computers can be difficult or easy depending on the type of task being performed.

Some task fall into the embarrassingly parallel category. This is where there is lots of data items that needs the same processing done to each. Here the data can simply be partitioned and sent to different CPU’s to work on independently.

Other tasks fall into the non-embarrassingly parallel category. These tasks require more complex coordination to work in parallel. Non-embarrassingly parallel tasks traditionally use shared variables and require locking of variables.

In both cases there are different threads of execution work for some process. These threads will be distributed over different CPU’s and computer nodes within the cluster.
Requirements
The distributed operating system needs to abstract the hardware and run the programs. While some overhead will be accepted the efficiency of the system is a priority. Hardware abstraction has been done very well in modern computers. There are programming languages like C that compile for many different machine architecture’s and languages like Java that run in a virtual machine environment.

Resource management

Some sort of shared storage is required. For different applications a traditional file system or a database might be required. These would use the storage of all the nodes, while maintaining redundancy. The file system and database should behave like traditional implementations.

Compute power distribution

I propose the process’s and threads get a context that abstracts and simplifies their distribution and fault tolerance. These context’s will be similar to the contexts of the thread/process on a single computer operating system, with fault tolerant and scalability extensions. Contexts would define inter-thread communication and  be associated with shared variables and thread locks. A global Context would exist for the entire process. Smaller Context groups could be defined for sub groups of threads within the process. The partitioning of threads will help the system optimize the layout of the nodes and limit the scope of shared resources reducing redundant copies being updated.
Inter thread communication, shared variables
It may be that a small group of threads need to work on a sub problem for a larger task. The parallelization would need shared variables. Normally this would require a lot of inter-thread communication, but through the use of these contexts that could be abstracted away. The threads would simply access the shared variable. The computer nodes that these threads work on should have minimal latency. By being in their own context group the system could know to put them close together. This is an example of a environment aware optimization.

Sometimes threads may need to wait for other threads. These waits could be managed by these thread context groups.
Fault tolerance
The thread contexts would ensure fault tolerance by state saving to some distributed memory. Then ensuring that the thread is still running with ‘heart beat’ messages. If the thread is found to have crashed another thread could be started from the last context save. Here the only thing of value lost is the computation time.

In traditional operating systems, threads are swapped out to allow multiple treads running on a single CPU. This is similar to the context saves in this distributed system. The same kind of information is being saved.

The context saves could be an expensive part of the distribution of tasks. Each save comes with a cost, but each crash comes with another cost lessened by the context save. Balancing the two costs could be quite difficult, especially for different cluster configurations. The distributed operating system should take care of how often context saves occur. By learning the cost of each save and how vulnerable the cluster is, a automated balancing algorithm could be implemented.
Locks
Locks should generally be avoided, especially in a redundant distributed context. A lot of locks could be able to avoided through the use of higher level data structures. For example In a producer consumer situation a thread safe queue maybe simpler to work with than a simple array. The array is a much simpler to implement, but isn’t thread safe so thread locking would be required, where the queue could be thread safe, requiring minimal locking from the clients.

Compare and swap style locking operations would also work well in a distributed system.

Traditional locks will may be required at some point, or may just be desired by application programmers, so they should be implemented. To ensure redundancy with traditional locks, the locks should have a TTL (time to live). Where if the TTL expires the lock invalidates leaving the rest of the program to continue. The thread with the lock could renew the TTL, to acquire more time with it. This could behave like heart beat messages to ensure fault tolerance.

Heterogeneous competition
With the context saves continually happening to the distributed system, two copies of the same thread context save could be started on different nodes. One may finish faster, the results from that node could be taken and the slower node could abandon its progress, as it would lead to the same result. This would be useful when the cluster is waiting for one thread to finish, which may be on a slower computer. This optimization was shown in the google map reduce paper [GMR] to greatly improve the overall performance of the cluster.

Embarrassingly parallel
The distribution of embarrassingly parallel tasks has been done well by the google map reduce [GMR] and the Apache hadoop project [AHD]. Similar map reduce functionality could easily be implemented within this cluster using these contexts. One thread could read the input and start worker threads which would send their results to reducer threads.

Resource management

File System
The distribution of the file system can be achieved by placing only part of the file system on each particular node called a brick. For redundancy duplicates of each brick would be replicated on different computer nodes.

The client using the file system would query which node a particular block was on then get that block from the respective node. This would all be packaged in the file system driver and appear like a normal file system to the applications.

The file system needs to support growing, having more storage added to it. This can be achieved by using a dynamic file mapping structure. Which would need to be distributed and redundant across the cluster.

Higher throughput could be achieved by striping the data across multiple servers. This is done by partitioning the data into blocks, where successive blocks go to different cluster nodes. This provides the speed when one block is physically being written or read from one disk, the next block can start being written / read from the next node. Other environmentally aware optimizations could also be implemented, such as file locality.

Shared memory
The system should store its context saves in memory. These will be broadcast to other nodes for redundancy. Not all nodes need to store the contexts for all other threads, just enough for redundancy. This could require lots of memory. But the treads can choose when they have reached a place to save state, ensuring that they only save whats required. To ensure context saves don't come out of order, they will have a timestamp. So if two computers compute the same context state, but one is further through its computation than the other, each thread will know how far through it is. The node receiving both will know what one to keep, the one with the higher time stamp.

The system will need to use a similar communication and shared memory structure for coordinating the execution of thread contexts. Care will need to be taken to ensure race conditions don't occur. The scheduling and starting of threads could be done in a two step phase. Where in the first phase the thread to start is scheduled, communicated to all nodes, and any race conditions are dealt with. The second phase takes place when the schedule is agreed upon, and the thread is actually started.


How close we are to building one.

Some of the components already exist, while others need work. Work needs to be done to package everything together for easy distribution.

Distributed File systems already exist, there is the Google file system [GFS] and the Gluster file system [GLS].

Redundant coordination already exists in some computer games. With multiple computers playing one network game, one computer is chosen as the server. If that server then goes offline the other computers re-elect a new server. This is similar to the coordination required for a distributed operating system.

Apache ZooKeeper also provides redundant distributed coordination software [AZK].

We have distributed hash tables and distributed databases like Apache Cassandra [ACD], which work like a distributed shared memory.  However their usually for persistent storage not like fast volatile shared memory required for task coordination.

Redundant distributable task contexts have been done well for embarrassingly parallel tasks, with map reduce and hadoop. Building on my proposed model should work well, but needs more work to show that its feasible, the context saving needs to be proven to not be too expensive, then it needs to be implemented.

So how close. A year with a solid development team should be enough to put something together, using another distributed file system and data base. The big problem will be getting it popular enough to gain momentum. A lot of the companies that could provide the initial push of development seem to work in more of a embarrassingly parallel world, so wouldn't benefit from this.

References
[GMR] Jeffrey Dean & Sanjay Ghemawat (2004)  MapReduce: Simplied Data Processing on Large Clusters. Google, Inc
[AHD] http://hadoop.apache.org
[GFS] Sanjay Ghemawat, Howard Gobioff & Shun-Tak Leung (2003) The Google File System. Google, Inc
[GLS] http://www.gluster.org/
[AZK] http://zookeeper.apache.org/
[ACD] http://cassandra.apache.org/