Sunday, December 19, 2010

Mystery 5.8Gb of app data on my iPad

So I go to sync my iPad for my short trip to Denver. I wanted to put a movie on it so I could watch something on the plane . Imagine my surprise when my iPad had these stats:

201012191312

5.8Gb of App data! Wow. Where did that come from? I go into the apps tab and sort my apps by size:

201012191314

The biggest app I have installed is 190Mb (big to be clear, but not 5.8Gb of big. If I add up all my Apps, I get somewhere in the 500Mb range. Where is all this space going? So I do some googling around and get no where fast. Turns out Apple has changed the way they do backups from version to version. Some say it should be under /private/Library. Some say /private/var, etc... Anyway, I found my backups under /Users/rwhiffen/Library/Application Support/MobileSync/Backup. Under there are some cryptic directory names, likely MD5 hashes. The one I'm interested in is from today when I'm syncing my iPad: 6bf83d2961ea2206b4c08edb555b2b0d89c7f218. Inside there are 4083 more hashed file names. One in particular is quite large:

 


rwhiffen:6bf83d2961ea2206b4c08edb555b2b0d89c7f218 rwhiffen$ ls -lh 5180d2cec771957569b3dc0a8eed20b536fa9185



-rw-r--r-- 1 rwhiffen rwhiffen 4.2G Dec 14 00:22 5180d2cec771957569b3dc0a8eed20b536fa9185



rwhiffen:6bf83d2961ea2206b4c08edb555b2b0d89c7f218 rwhiffen$


 

So that's probably my problem App. Now I just need to figure out how to get 5180d2cec771957569b3dc0a8eed20b536fa9185 translated into something meaningful. There are a few files:

 


-rw-r--r-- 1 rwhiffen rwhiffen 93500 Dec 19 13:01 Info.plist



-rw-r--r-- 1 rwhiffen rwhiffen 660554 Dec 19 13:01 Manifest.mbdb



-rw-r--r-- 1 rwhiffen rwhiffen 126656 Dec 19 13:01 Manifest.mbdx



-rw-r--r-- 1 rwhiffen rwhiffen 7025 Dec 19 13:01 Manifest.plist



-rw-r--r-- 1 rwhiffen rwhiffen 189 Dec 19 13:01 Status.plist


 

That probably have the data, just need to figure out how to read them. Info.plist just has some interesting xml data about my iPad. Manifest.plist looked promising. It was in binary so I had to convert it to xml first: plutil -convert xml1 Manifest.plist (be sure you copy it to a temp location first and don't convert the original...). Manifest.plist looked like a bust too. Status.plist is just the status of the last backup. So no useful stuff there. Ugh.

 

So that means I need to try and slog through the mbdm and mbdx files. Yikes. Fortunately someone else has been there first: http://stackoverflow.com/questions/3085153/how-to-parse-the-manifest-mbdb-file-in-an-ios-4-0-itunes-backup The user Galloglass' solution worked best for me. I cut and pasted his Python code into a file lsback.py and chmod'ed it 755.

 


rwhiffen:6bf83d2961ea2206b4c08edb555b2b0d89c7f218 rwhiffen$ ./lsback.py | grep 5180d2cec771957569b3dc0a8eed20b536fa9185



-rw-r--r-- 000001f5 000001f5 4479500672 1291642994 1291642994 1291592204 (5180d2cec771957569b3dc0a8eed20b536fa9185)AppDomain-com.polishedplay.puppetpals::Documents/ipad/recordings/new/audio



rwhiffen:6bf83d2961ea2206b4c08edb555b2b0d89c7f218 rwhiffen$


 

So it would seem that there is a stray audio recording for PupetPals , one of my kids games. So I deleted it, re-synced the iPad, then added the App back.

201012191341

Now a more reasonable number. Still higher than I would have thought, but not 5.8Gb.

 

Friday, November 19, 2010

More splunk fun...

I've started setting up summary indexes. I take results and put them in a second index for reporting. First you have to create the new index, mine's called "dashboard_summarize". It will require a restart of splunk, just so you know. Next up, the ugly query:


search = host="srchqenmana*" (source="/usr/local/tvs/apache-tomcat/logs/qlogger/*" NOT source="*.gz") "<A9_Request" AND NOT ("FFFFFFFFFFFF" OR "000013ED3AEB" OR "Agent.007") | lookup Market_by_Controller_ID Controller_ID as Controller_ID OUTPUT Market as Market | eval QueryFirstTwo=substr(TextQuery,1,2) | transaction MAC, QueryFirstTwo maxspan=5m maxpause=1m delim="," mvlist=TextQuery | eval LastQuery=mvindex(TextQuery, -1) | fillnull value=0 forward | eval MAC="costtimequalityscope".MAC | eval MAC=md5(MAC)|stats count(LastQuery) as QueryCount by LastQuery, Market, Controller_ID, StreamingServerID, forward | fields QueryCount LastQuery Controller_ID StreamingServerID Market forward |collect addtime=true index=dashboard_summarize


Yikes! Lets break that down a bit. First up we have the sifting portion of the query. Basically search terms that rule data pieces in our out:

host="srchqenmana*" (source="/usr/local/tvs/apache-tomcat/logs/qlogger/*" NOT source="*.gz") "<A9_Request" AND NOT ("FFFFFFFFFFFF" OR "000013ED3AEB" OR "Agent.007")

Next up we have some data lookups. We take the numerical ControllerID and map that to a human readable market name like 'Salt Lake' or 'Bucks County'.

lookup Market_by_Controller_ID Controller_ID as Controller_ID OUTPUT Market as Market
Next we start doing calculations, conversions and transformations of the data. We'll stanza by stanza this part:

eval QueryFirstTwo=substr(TextQuery,1,2)

Eval a field called 'QueryFirstTwo' to the first two letters of the string TextQuery using the substr function

transaction MAC, QueryFirstTwo maxspan=5m maxpause=1m delim="," mvlist=TextQuery

This little gem is a beauty. I wish I could take credit for what the Splunk consultant did there. Basically we define what a single user search is here by defining what a transaction is.. We do not count just the simple submission of a request, because we do live updating of search results after two letters. So if you were searching for the show HOUSE, with live updating you would make a request for HO, HOU, HOUS, HOUSE at every key press. That's great if your just measuring raw throughput, not not a valuable business data point. If everyone is searching for a really long search terms like SUPERNATURAL your usage stats would be skewed. So we roll those up into a single transaction by setting some parameters. First, we time box it at 5 minutes. Second we only allow for a 1 minute pause. Sure there are edge cases where you may exceed either of these time boundaries but it should be a wash over all. Further the MAC address and the first two letters of the search must also be the same. This lets us have typos later on. So if you did HOUU the HOUS because HO would match, it's still in the same transaction. And the last little bit, mvlist=TextQuery says to make a multi-value (or array) of TextQuery values used in this transaction. In my example the list would have ("HO", "HOU", "HOUS","HOUSE"). This comes up in our next stanza.
eval LastQuery=mvindex(TextQuery, -1)

If you look up mvindex and it's syntax, you see that we're setting the field LastQuery to the last entry in the list. In my example, LastQuery=HOUSE. Side note: the page linked for mvindex is titled 'Common Eval Functions' according to the URL. I'd hate to see the uncommon ones.
fillnull value=0 forward | eval MAC="salted".MAC | eval MAC=md5(MAC)

I'm grouping the next three stanza's together because they're doing similar things. If the field named "forward" is null, set it to zero. Next we add a salt to the MAC address to obscure/anonymize it. The MAC (much like an IP address), while not directly identifying an individual is sensitive just the same, and needs to be hidden, so first we add the string salted to the current value of MAC. Think of this like a password or key. Next we convert the string+MAC value to the MD5 HASH of that string. So 000013ED3AEB becomes salted000013ED3AEB which becomes ce431f1c1a634337ca1cdcde78a1d15f. Now if someone knows someone's MAC address and does echo -n "000013ED3AEB" | md5sum to try and figure out their new obscured value, they can't because they don't know the SALT. And because the salt can be of arbitrary length, brute force isn't effective. So it's reasonably protected if for some reason the data needs to be shared with non-trusted parties.

stats count(LastQuery) as QueryCount by LastQuery, Market, Controller_ID, StreamingServerID, forward, MAC
This one is fairly straight forward. Get the number of times the search term was searched, organized by Market (which we looked up in a table before) and Controller_ID, StreamingServerID, and the value of forward (which are app specific fields that only has meaning to us) The why of this is coming up.
fields QueryCount LastQuery Controller_ID StreamingServerID Market forward, MAC

Next we want to take the fields listed above and output them in the search results (why is next).
collect addtime=true index=dashboard_summary

Lastly, we collect this data and store it into an index called 'dashboard_summary'. What we're doing is making a roll-up of searches and weeding out all the cruft that isn't needed to make the reports or dashboards. Further because we've scrubbed sensitive data, we now can let a larger audience view the data by giving them only permissions to this new index. Because the index is lean and mean, dashboards and reports are several orders of magnitude faster than going against the raw data. Further we've pre-paid a lot of calculation expense with the eval's and transaction logic.  
Now I have an index to do my reporting out of that's much faster than the raw queries against all the data.

Thursday, November 18, 2010

Working with Splunk

I've been doing a lot of work with Splunk lately. Splunk is a powerful and flexible indexing tool. It slurps up log files and data and makes them searchable. I think the real power of Splunk over a lot of other log management and searching tools is it's ability to search across multiple servers for the same time period. Another powerful feature is it's ability to do field extraction. So when a log file says "IP_Address=10.11.12.13" you can do field related searches like "AND IP_Address=10.11.12.13" or more powerfully "NOT IP_Address=10.11.*"


Fields are where I'm spending a lot of my time lately. In our current search and discovery platform we have lots of fields with interesting values from people making search requests. We have values such as channelmap, controllerID, MAC, TextQuery and a few other interesting values. Because we have these interesting field values and Splunk extracts them for us, we can generate very interesting usage reports. Such as number of unique users, users per market, etc. And because we have a relatively closed set of users, we can produce interesting numbers like the percentage of users our platform. Powerful stuff.


Anyway, I hope to write up some of my more interesting uses of splunk in the future.


Sunday, September 26, 2010

Ping gets more useful

I didn't get Ping from Apple.  What good was it?  I mean even if you had all the artists in the iTunes 'verse online and updating, so what?  Well with the recent update to iTunes, it's starting to become more of what I thought it should be.   They finally added some features that iLike.com and last.fm had all along.
iTunes 10.0.1 makes it easier to share your favorite music with your friends on Ping. You can now Like or Post about music right from your iTunes library. You can also easily see the recent activity of a selected artist in your library, or of all artists and friends you follow on Ping using the new Ping Sidebar.

So now it gets more integrated into the core App, you don't have to visit the iTunes store to view Ping info.  It has a nice side bar like ilike.com (a service I've stopped using some time ago).  You get a facebook/twitter style timeline of what others bought, followed, commented, etc.  All of this is nothing new in the 'social media' world.  Nor is it done innovatively or particularly well.  I think it's too much of a 'me too' move by Apple.  Simply putting Apples brand and market presence behind it isn't enough.  Google Wave?  Microsoft Zune or Bing? The product still has to be good and useful.  So far Ping seems to be neither to me.  It also seems to be solving a problem nobody has.  Facebook and Twitter tell me all about what my friends are up to.  Do I want to go yet another place to see what they're listening to?

Hopefully Apple will roll out a regular string of improvements to the service.  In typical Apple fashion they're not rushing into this.  They released the first, fairly crippled version a few weeks ago.  They've already released the first update.  With any luck they'll release another before the years end.  I'd like to see it incorporate the Genius suggestions some way.  It'd also be interesting to give out some kind of badge or award for listening and rating.  They also need to improve the way you find people to follow and they suggestions they generate. Some other interesting features would be to suggest a 'mood' a person is in based on the music they listen to.

Now on to the wild speculation based nothing except my wishful thinking.  Ok, suppose they get cool new features into Ping.  So what?  Is it enough to reach the tipping point?  I doubt it.  But what if it's part of a bigger plan?  What if Ping goes beyond iTunes and takes the next logical step and gets integrated into iPods, iPads and iPhones?  Now it's more than music.  But that's not enough. I can already twitter and facebook on those devices. What if it extends further to the AppleTV? Now it's about what I watched in addition to what I listened to.  Now it's getting interesting.  That's one niche that hasn't been filled by cable or FiOS.  Tivo, Roku and Boxee are headed there, but they're one dimensional.   Watch a great TV show, then comment about it to all your friends.  Even better if it could be done while watching.  Ping your buddy while watching; 'Hey, I know you'd do exactly what Wolowitz did with the robot arm!'  Now Ping becomes something more than a copycat app.

Anyway you slice it, Apple has a lot of work ahead of them if they hope to turn Ping into another reason to use iTunes and the Apple eco-system.

Tuesday, September 14, 2010

Need some new fitness gadgets...

So I have a GPS on my bike, an older 12-channel eTrex. It has it's problems. It looses signal too often in the city so my stats are off a bit (I've was at an elevation of -5 feet) for a mile or so today. So I'm looking for new gadgets to use for this.


So far I've come across the ANT+ system by Digifit and I think it does exactly what I want. Since I use my iPod when I ride anyway, it's one less gadget to carry around. Amazon sells it for ~$80. Since ANT+ is a relatively open system, there's multiple vendors making gear for it:



  • Adidas sensors and devices (ANT+)


  • CycleOps sensors and devices (ANT+)


  • Garmin sensors and devices (ANT+)


  • Quarq sensors and devices (ANT+)


  • Spinning® / StarTrac


  • Tanita weight scales (ANT+)


  • Timex sensors and devices (ANT+)


  • Wahoo sensors and devices (ANT+)


Plus a host of others. So I can add a Garmin speed sensor and a heart rate monitor. If I go to the gym I can get the info from the StarTrac treadmills. Not sure I'll go as far as the Tanita scales though.


Boxee pre-order available (why would you?)

So Boxee is now availble for pre-order, according to the press-release I was emailed:


D-Link has signed up Amazon http://amzn.to/theboxeeboxbydlink (in the US) and Best Buy http://www.bestbuy.ca/boxee / Future Shop http://www.futureshop.ca/boxee (in Canada) as exclusive pre-order partners for the Boxee Box.

 

The highlights they point out:


  • it will have access to more HD content than its PC cousin


  • no need for keyboard/mouse in the living room or running a 10ft cable to connect your laptop


  • it’s beautiful, though a bit pointy in parts : )

 

Beauty is in the eye of the beholder I guess, I don't care for it myself:

201009141143

It's OK I guess, but that needs to sit next to your TV. It's only 4.5" x 4.5" x 4.6" so it's not too big, you could probably squeeze it on top of your DVR or cable box, but the area around it becomes unusable space for me. The one thing I loved: The remote. It has a basic 4-axis control pad with a 'select' button, a play/pause and what looks to be a power button. That's all run-of-the-mill stuff. The cool stuff is on the flip-side. Flip the remote over and it has a full keyboard. If you've ever tried to search on an AppleTV, Tivo or Cable remote you know how huge this is. No word so far on if it supports Hulu. The big thing for me is the price. $199 per unit. It's $100 more than the new Apple TV and $140 more than a Roku. Unless they have Hulu, I think the Boxee box is DOA....

 

Friday, January 8, 2010

My kids will never have a 'must see TV' night

There was a time when Thursday nights were 'must see TV' nights. Friends, Seinfeld and ER made a pretty compelling night of 'must see TV' as the slogan went. A significant portion of the nation would be sharing the same experience Thursday nights. I remember getting up early on a Saturday, even though it wasn't a school day because that's when the good cartoons where on. I doubt my kids will ever have that notion or experience. Between DVR/TiVo, On Demand broadcasts, Web delivery and AppleTV there's not as big a driver to sit down at a scheduled day and time and watch.


For the past few years we only had terrestrial broadcast TV and an AppleTV in the house. One time, while watching Arthur on PBS, Renee had to go to the bathroom and was jumping up and down demanding we 'pause it!', not understanding that not all TV shows were like the AppleTV version. A few months back we added Cable to the house and a DVR unit, so now we can pause live TV, further blurring the distinction between the on demand vs on schedule showings. She doesn't understand why she can't watch Zula Patrol any time she wants and has to wait until 7:30 to see it. She's convinced its something that I'm not doing for her, and not a case of it not being available On Demand or via AppleTV. Further there's only one episode to watch and when it's done it's done. With all her other shows there's always another episode, so she asks to watch another Zula Patrol, and i have to tell her no. From her perspective it's no different then saying she can only watch one Super Why. We have more, I'm just not allowing her to watch them. Except in this case, we legitimately don't have more to watch. Kind of works in my favor I guess, there's no chance I'll cave in and let her watch a 2nd one.


As more and more entertainment options become less tied to the providers schedule and less tied to the TV as the only way to watch it, the notion of a good night for TV will wither away. There will still be some notion of scheduling, but it'll be the date and time it's put on the distribution network. It probably won't be the same though. Even for me it's not quite the same. I love CBS's monday line-up, but I don't think I've watched any of them at there actual broadcast time in 2 or 3 years. I don't think TV's dead or going away. As Randall Hounsell put it, TV is still "a lean back experience." People will still want to get someplace comfy and be immersed in a world that isn't there own.


Back when CD's were the norm, but vinyl records were still around, my over-used joke used to be that my kids were going to be asking me "Dad, how do we get this big black disc into the CD player?" Now I'm not so sure they'll even remember what a CD is. Never thought the same would happen for TV. The Qwest commercial from the late-90's is finally coming true.







A tired man goes into a cheap motel in the middle of nowhere and asks about amenities. When he asks about entertainment, the girl responds "all rooms have every movie ever made in any language anytime day or night." It'll probably be 20 years after the fact, but it's coming.