The Perpetual Notion: BeagleBoard

Showing posts with label BeagleBoard. Show all posts

20110803

OMAP3 SGX EGL Drivers Add Wayland Support

Just in case anyone was wondering this is a pretty big deal. Imagination Technologies, whose 3D graphics cores drive most mobile displays, has announced support for the EGL_KHR_Image_Pixmap extensions used by the Wayland display server protocol.

For those who haven't been following, Wayland has gained a lot of momentum as a non-X-based window compositor for Linux-based operating systems. Wayland facilitates client-side rendering, similar to the Quartz compositor used in Mac OS X. It has since been adopted by Meego and Ubuntu as their preferred compositing backend.

The stated goal of Wayland is to provide a user experience where "every frame is perfect". This is a rather necessary and long overdue improvement since traditional Linux desktops based on the aging X11 display server tended to suffer from artifacts such as tearing, visible redrawing, and flickering. However, Wayland retains the capabilities to encapsulate the traditional rootless X server for legacy applications. Wayland rendering targets already exist for popular toolkits such as GTK+ and QT among others.

Check out the video below for a (slightly older) demo.

Today, Wayland support exists for graphics chipsets from Intel, AMD, NVIDIA (nouveau) and SGX (OMAP3) platforms. OMAP4 support probably isn't far off.

I guess it's time to fire up the old BeagleBoard ;-) Incidentally, happy birthday!

20110330

The USRP E100

I thought that I would take a moment to plug a product that I think has great potential for anyone working, experimenting, or interested in learning about digital wireless communication - the USRP E100. This device, jointly developed by Ettus Research and OpenSDR, was announced just a few months ago. It's a tightly-integrated embedded Linux solution for research into digital baseband signal processing for wireless systems.

I've worked with previous products from Ettus, like the USRP2, and have had 99% good experiences. The entire USRP product family is supported with GNURadio, which greatly facilitates signal visualization, processing, and software interfacing. The one down-side of using the USRP2 was that the only way to connect with it was by using a Gb ethernet cable (which is not a standard feature on laptop / desktop computers). The Gb ethernet port did not make the USRP2 'networkable' since it was only used for data transfer using raw ethernet packets.

The main differentiator of the E100 is that this device ships with a modular ARM board from GumStix. The stock GumStix board is powered by an OMAP3 (Cortex-A8) chip from Texas Instruments. The modular design makes repairs and upgrades easy (any COM can be used with conformant electrical and mechanical specifications). The OMAP3 has appeared in several mobile phones, but (more importantly) has also been the driving force behind a tidal wave of low-cost and powerful embedded Linux developer boards such as the BeagleBoard and BeagleBoard xM. Texas Instruments really has made a great contribution back to the developer community just by making these developer boards available. The OMAP3 processor is capable of running just about every operating system in existence, ranging from Windows Mobile to Ubuntu or Android (all flavours of Linux, and FreeBSD too). The E100 (probably) ships with Ångström by default. As for interfacing, the E100 even exposes HDMI, ethernet, and USB ports so this SDR box can literally be it's own work-station. I really wish this was available back when I was working on the USRP2!

So - that's great - an SDR device that eliminates the need for an external laptop or desktop computer so the entire system consumes much less power in total.

There's just one more thing...

The way that the OMAP3 interfaces with the radio hardware is super-efficient. The TX and RX buffers are mapped directly in to the OMAP3's MMU. To the layman, this means that the Linux kernel can easily expose the radio as a regular device to userspace using Phil Ballister's driver, which is on its way upstream. Furthermore, users of TI's Code Composer Studio (or developers who choose to use CGT directly) can write DSP firmware for OMAP3's integrated C64x+ DSP. Thus, keen developers can run code on the DSP to control the baseband radio and process baseband signals directly (the way nature intended). Naturally, only one processor on the chip can 'own' the radio buffers at one time (without proper synchronization).

To summarize: the USRP2 E100 is the ideal product for most engineers researching embedded RF systems and digital baseband processing.

PS: Nice work Phil! (he was my co-mentor for GSOC2010). I would love to use the E100 for some of my more recent work with ahumanright.org to engineer a low-cost / low-power satellite modem...

20101106

MultiCore Threading

This post is partially in response to a question somebody asked me recently about threading on ARM Cortex-A9 systems. I was asked whether or not that just by creating a several new "threads", whether the threads will "automatically" run at the same time on separate cores without any operating system or system library interaction. The short answer is no.

The long answer begins with a 1-minute history of computer architecture. A processor generally has something called an instruction pipeline. For scalar architectures, this meant that only one instruction (read: hardware function) could ever be executed at any given time. Some clever hardware engineers determined that this was not utilizing the hardware as effectively as possible, and so they came up with the idea of a pipeline, or the superscalar architecture, which allows more than one hardware function to be executed at a time. Generally speaking, this meant that if the 'add' function was being used at one point in time, the 'memory fetch' function could also be used at the same point of time. This introduced something they industry termed a 'data hazard'. For example, if a certain add operation depended on the result of a memory fetch operation, then the add function would produce unanticipated results if the memory fetch operation had not completed in time. The first solution to this problem was to introduce stalls in the pipeline, which were (and still are) very bad. The second solution (really an improvement on the first solution) was to add another hardware unit in to the chip that would re-order the instructions before sending them down the pipeline in order to minimize pipeline stalls due to data hazards. That hardware unit was called an out-of-order execution unit. Instruction scheduling can actually be done in software as well, by the compiler and linker, but since this only allows off-line instruction scheduling, it cannot account for asynchronous events that are only stochastically predictable. This is where the branch prediction unit comes into play, but I'll omit that for brevity. So far, only instruction-level parallelism has been covered.

The Cortex-A9 Pipeline

The Cortex-A9 MPCore

Now, most manufacturers realized that it would be best to let uniprocessor code execute on multicore systems so that programmers and compiler designers wouldn't have nervous breakdowns trying to optimize their code for the googles of system permutations that would be in existence (all of them would be vector processors). Thanks to all manufacturers for that one. ARM is no different, for the Cortex-A9 family of processors still belongs to the ARMv7a instruction set architecture. However, the general decision to run uniprocessor code on multicore systems necessitated the use of software entities to actually manage and, really schedule, when and where that code would be executed.

Getting back to the original question, it's important to consider what a 'thread' actually is. A thread is a pure software abstraction for a logical sequence of events. Threads are often associated with a priority, a state (e.g. ready, waiting, zombie), and an instruction pointer. As for the threading abstraction (e.g. POSIX threads), they must a) introduce data protection primitives, as well as mechanisms to b) wait until data is not in use and c) signal when data is no longer in use. Usually, the operating system deals with scheduling which threads are running at any given time, although it isn't that hard to do this without an operating system. The fundamental method of synchronizing threads is via shared sections of memory and atomic processor instructions. The thread scheduler uses timer-generated hardware interrupts to periodically evaluate the state of all threads, and then schedule code (i.e. determine the next branch target) for 1 to N cores. In the case of a uniprocessor system, this means that the scheduler itself is being swapped in and out after a certain number of time slices, where each time slice is occupied with a thread based on priority, state, etc. The number of cores available at any given time is also controllable with software, since cores can be dynamically powered off to save energy. This is something that the thread scheduler must take into account.

As for initialization of each core, typically a single core would be activated at power on, then as the operating system (or main binary) launches, a threading manager would also be launched. The threading manager would initialize and create descriptive data structures for the remaining cores on the system, and so on. As each core runs, it literally operates in a loop; 'jump' to an instruction and start executing, or go to sleep if not needed; then do the same thing again. The details on power up, particularly in the case of the ARM architecture, are very manufacturer dependent, since e.g. an OMAP MPCore implementation can have several physical differences and register locations than, e.g. an MSM MPCore implementation.

In summary, sure, it's easy to have several cores running at the same time, but getting them to coordinate shared data properly (i.e. run threads with shared data sections) requires that the concept of parallel execution be built into an application or library (which is not always easy). For a simple example, assume a library allocates 512 MB (i.e. 2**19 bytes) of memory, sets it all to zero, and then deallocates the memory. Would it run any faster on a multi-core system than it would on a single-core system? Absolutely not, because the processor cores do not follow the programming methodology of DWIMNWIS (unless it has a pretty advanced hardware rescheduler).

If I modify the library to first query a threading library to find the number of logical cores, partition my buffer into N sections, and then create several threads that are aware of their own partition boundaries, then I can expect my library to perform faster by a factor according to Amdahl's law: S=1/(F+(1-F)/N). In this case, since the amount of the problem that is not parallelizable is 0%, F=0, and the speedup is S=N. However, even in the case where a thread scheduler is present, there is still variability about where the code will actually run - for example, all threads could even be run on a single core, rather than distributed among them all.

Amdahl's Law: S = 1 / ( F + (1-F)/N )

Oddly enough, even though I have been writing threaded code for about a decade, I still have a pretty antiquated workstation on my desk (by today's standards). Indeed, my workstation is a single-core pentium-m laptop. It is a surprising hunk of garbage that never decides to finally die, although its been close on several occasions. In any case, I hope to have that upgraded soon to a quad-core Intel i7 machine, so that I can have 8 logical threads to speed up my ultimate goal of world domination.

Also, I'm looking forward to receiving a PandaBoard shortly, with an OMAP4440 dual-core Cortex-A9 chip. This will give me incentive to do some SMP performance tweaks on some of the NEON-enabled software I've written lately (e.g. FFTW).

20100514

GSOC 2010 Redirection

Hi Folks,

I just thought that I would put something up here to redirect queries about my GSOC 2010 project entitled "Neon Support for FFTW". Since my primary blog (the perpetual notion) has several completely unrelated categories, I've set up a dedicated blog for this GSOC project to avoid any potential confusion.

So please visit gsoc2010-fftw-neon.blogspot.com for any GSOC-specific information.

20080912

BeagleBoard Notes

My BeagleBoard recently arrived from DigiKey and after resoldering an RS232 connector and downloading the binary images, I was good to go.

If you're planning on only using the BeagleBoard as a USB gadget, connected to a PC, where the PC acts as the USB master, then you do not need to worry about a USB Mini A cable.

If you would like to network the BeagleBoard and a PC through the USB OTG port, then a USB Mini A cable is not necessary. You would use a regular USB Mini B cable which is the type used for most digital cameras.

20080905

Toolchain for the Neo FreeRunner

Well, it's been a while, hasn't it !?

Erin, Jules, & I were all quite busy over the last few weeks - we travelled all over the lower half of Ontario, and then went camping at Bruce Peninsula National Park & on Manintoulin Island.

(Note to self, make Google map containing a trail of our route + photos)

We're now back in Montreal, and it's pretty intense. I have 4 exams coming up in the next 3 weeks and will surely have my hands full for all of them. Hopefully the examination board at the Uni-Kiel allows me to write the exams remotely from Montreal, given my special circumstances (baby, f/t job, etc), so that I don't have an overly-demanding schedule when I return in March / April.

Soooo wie so, I was pleased to receive my new Neo FreeRunner mobile phone when I got back to Montreal. Last week I upgraded the firmware to om2008.8 which has a slick, webkit-based UI. After having used it for a week, I can honestly say that this guy will be the iPhone killer, for anyone who likes to do extreme things with mobile devices at least.

My EEE PC is doing very well, and another one is on the way for Erin. I was hoping that it would get here in time for her birthday, but it seems that there is an 8-10 week shipping delay from the Royal Bank!!! Daaaamn!!!

In a few days, I'm expecting to receive a BeagleBoard in the mail along with a 15.1" touchscreen panel - I can't wait ;-) Then is a cipherlabs 9400 handheld for industrial scanning. I'll be putting linux on that too.

I really have to say, though, that I'm really starting to feel the lack of a graphical package manager for Gentoo-based mobile devices. Seeing how it's my job now to implement a web-based, distributed (pushing) package management system, I don't think it will be very hard for me to implement a mobile (pulling / normal) graphical package manager.

There's another guy on the gentoo-embedded list from Portugal, named Ângelo (a.k.a. miknix), who is also aiming to do the same thing with the HTC Wizard (also a pretty sweet-looking handheld w/ integrated keyboard).

Anyway, if there are any Gentoo users out there who would like to download an i686-pc-linux-gnu -> armv4tl-softfloat-linux-gnueabi cross-toolchain suitable for OpenMoko cross-compilation, then check out my latest toolchain. Please don't forget to read the README file.

I guess the next cross toolchain I come up with will target the armv7a-c6x-linux-gnueabi BeagleBoard, which also happens to have Jazelle Technology (something I have wanted to experiment with for a while!).

The Perpetual Notion