Intro to Big Data

John Alexis Guerra Gómez
@duto_guerra

Use spacebar and the arrows to advance slides

http://infovis.co/introToBigData

Outline

  1. What is Big Data?
  2. How to process/store it?
  3. How to make sense of it?

What is Big Data?

You might have heard of the Vs of Big Data

  • Volume
  • Velocity
  • Variety
  • and Veracity and Value
  • Too ambiguous!! Let's go beyond that

How Big is big?

Can you fit it in one computer?

Yes? -> Then is not really big

Let's call it big data only if it doesn't fit on one computer (and has the 3Vs)

Why this criteria?

Because if it fits in one computer you don't need all the overhead of big data technologies, just use a traditional relational database.

Example: photo collection

  • One photo -> 10MB
  • 1k photos in a cellphone -> 10MB * 1k = 10000MB = 10GB
  • 50k photos in your computer -> 10MB * 50k = 500GB
  • Is that big data?
  • No, you can fit that in one cheap external hard drive
  • BTW, Flickr gives you 1000GB (1TB) for free

Problem: count how many blue photos in my collection?

How do you compute this?

Put all your photos in one computer

Go through all the collection and count

Flickr size

80+ trillion photos (80'''000''000'000.000)

That's big data

How many blue photos on Flickr?

How do you compute this?

Distribute the data among hundreds of thousand of computers (a cluster).

Compute subtotals on each chunk of the data. (Map)

Aggregate the subtotals into one big total. (Reduce)

How many computers do you need?

total / one computer capacity?

What if one computer breaks down?

We need redundancy -> Each photo is stored in many computers

How do we control versions? How to keep records? What goes where?

That's why we need big data!!

Technologies

  • MapReduce (Hadoop, Hive, pig, Spark ...)
  • NoSQL Databases (Redis, Cassandra, MongoDB, Neo4J)
  • Distributed Relational (SQL) Databases (MySQL, PostgreSQL, Oracle, SqlServer)
  • Many others

Hadoop

  • Computing platform for big data
  • Uses clusters for storing and processing the data

Hadoop Architecture

Spark

A distributed computing alternative of to map reduce.

  • Easier to use
  • Integrates better with traditional programming models

NoSQL Databases

  • Scalable storage platforms that use techniques different to traditional SQL databases
  • Sacrifices features for performance

Types of NoSQL

  • Column Oriented: Cassandra, HBase, Redshift ...
  • Key-value: Redis, memcached, Aerospike ....
  • Document based: MongoDB, CouchDB, DynamoDB ...
  • Graph based: Neo4J, Titan, ...

Bonus

Introduction to NoSQL for Web Developers

Distributed Relational DB

  • You can also use traditional databases on a distributed way.
  • Divides the database into shards.
  • Usually doesn't scale that well.

Others

  • Google DataFlow
  • Google's replacement for MapReduce based on flows.
  • Supposed to scale better.
  • AFAIK can only be used with Google's Cloud.

How to make sense of it?

  • Statistical Analysis
  • Machine Learning and Artificial Intelligence
  • Visual Analytics (and data analytics)

Data Mining/Machine Learning

Information Visualization

Infovis + Algorithms

Traditional

  • Query for known patterns
  • Display results using traditional techniques

Pros:
  • Many solutions
  • Easier to implement

Cons:
  • Can’t search for the unexpected

Data Mining/ML

  • Based on statistics
  • Black box approach
  • Output outliers and correlations
  • Human out of the loop

Pros:
  • Scalable

Cons:
  • Analysts have to make sense of the results
  • Makes assumptions on the data

InfoVis

  • Visual Interactive Interfaces
  • Human in the loop

Pros:
  • Visual bandwidth is enormous
  • Experts decided what to search for
  • Identify unknown patterns and errors in the data

Cons
  • Scalability can be an issue

Why should we visualize?

Anscombe's quartet

Anscombe's quartet

Anscombe's visualized

In Infovis we look for insights

  • Deep understanding
  • Meaningful
  • Non obvious
  • Actionable

Types of Visualization

  • Infographics
  • Scientific Visualization (sciviz)
  • Information Visualization (infovis, datavis)

Infographics

Scientific Visualization

  • Inherently spatial
  • 2D and 3D

Information Visualization

Infovis Basics

Visualization Mantra

  • Overview first
  • Zoom and Filter
  • Details on Demand

Perception Preference

Adapted from from:Tamara Munzner Book Chapter

Data Types

1-D Linear Document Lens, SeeSoft, Info Mural
2-D Map GIS, ArcView, PageMaker, Medical imagery
3-D World CAD, Medical, Molecules, Architecture
Multi-Var Spotfire, Tableau, GGobi, TableLens, ParCoords,
Temporal LifeLines, TimeSearcher, Palantir, DataMontage, LifeFlow
Tree Cone/Cam/Hyperbolic, SpaceTree, Treemap, Treeversity
Network Gephi, NodeXL, Sigmajs

Take home message

  • Big data? Sure, If it doesn't fit on a computer
  • How to process it? MapReduce, Spark, ...
  • How to store it? HDFS, NoSQL, Distributed RDBMS
  • How to make sense of it? Statistics, ML, Visual Analytics

Thank You

Questions?

John Alexis Guerra Gómez

johnguerra.co
@duto_guerra