πŸ‡¨πŸ‡΄ @guerravis
πŸ‡ΊπŸ‡Έ @duto_guerra

Why Computer Science?


I have super powers:
I know how to code


Let me tell you a story

Young boy

Small town in Colombia

Courious as b##p

Wins a Fulbright Scholarship

PhD in Maryland

Works in Silicon Valley

Returns to Colombia

So I want to buy a car that I can resale in 2 or so years and not lose too much money πŸ€·πŸΌβ€β™‚οΈ

What car should I buy?

Normal procedure

Ask friends and family

Renault 4
Renault 4 JP4
Teilgefalteter Renault 4 am Strassenrand

Problem

That's inferring statistics from a sample n=1

Better approach

Data based decisions

Screenshot Tucarro.com
https://tucarro.com

Presidential Elections

Twitter Influentials

Twitter election analysis

https://public.tableau.com/app/profile/john.alexis.guerra.g.mez/viz/AnlisisPosiblesRobotsEleccionesColombiaMay25/AnlisisEleccionesPresidencialesColombia

Visualization?

The purpose of visualization is insight, not pictures

Defining Information Visualization (vis)

"Computer-based visualization systems provide visual representations of datasets designed to help people carry out tasks more effectively."
Tamara Munzner

Why?


A good visualization enables users to complete tasks effectively on the data.

But what are insights?

  • Deep understanding
  • Meaningful
  • Non obvious
  • Actionable
  • Based on data

Information Visualization

Why should we visualize?

I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
Property Value
Mean of x 9
Variance of x 11
Mean of y 7.50
Variance of y 4.125
Correlation between x and y 0.816
Linear regression y = 3.00 + 0.500x
Coefficient of determination of the linear regression 0.67

https://dabblingwithdata.wordpress.com/2017/05/03/the-datasaurus-a-monstrous-anscombe-for-the-21st-century/

Datasaurus!


https://dabblingwithdata.wordpress.com/2017/05/03/the-datasaurus-a-monstrous-anscombe-for-the-21st-century/

In Infovis we look for Insights

  • Deep understanding
  • Meaningful
  • Non obvious
  • Actionable
  • Based on data

How do I do it?

What do I use?

Visualization Science

Problem Abstraction

What/Why/How

  • What is visualized?
    • data abstraction
  • Why is the user looking at it?
    • task abstraction
  • How is visualized?
    • idiom visual encoding and interaction

Abstract language avoids domain specific pitfals
What/Why/How to navigate systematically the design space

Marks and Channels

Analyze Idiom Structure

Marks

Point
Line
Area

Channels

Channels

Channel Types

Channels

How do our Senators vote?

https://johnguerra.co/viz/senadoColombia/

Big Data?

You might have heard of the Vs of Big Data

  • Volume
  • Velocity
  • Variety
  • and Veracity and Value

Too ambiguous!! πŸ€¦πŸ½β€β™€οΈ Let's go beyond that

How Big is big?

Can you fit it in one computer?

Yes? πŸ‘‰πŸΌ Then, is not really big πŸ€·πŸ½β€β™€οΈ

Why this criteria?

Big data πŸ‘‰πŸΌ Big overhead

Example: photo collection

  • One photo πŸ‘‰πŸΌ 10MB
  • 1k photos in a πŸ“± πŸ‘‰πŸΌ 10MB * 1k = 10000MB = 10GB
  • 50k photos in your πŸ’» πŸ‘‰πŸΌ 10MB * 50k = 500GB

Big Data? πŸ™…πŸ½β€β™‚οΈ

How many blue photos are in my collection?

How do you compute this?

  • Put all your photos in one πŸ’»
  • Go through all the collection and count the blue ones

Flickr scale

80+ trillion photos (80'''000''000'000.000)

That's big data

How many blue photos are on Flickr?

How do you compute this?

  • Distribute the data among 100s of πŸ’»πŸ’»πŸ’»s. (a cluster)
  • Compute subtotals on each data part. (Map)
  • Aggregate the subtotals into one big total. (Reduce)

How many computers do you need?

What if one computer breaks? ☒️

Conclusion

Big Data? πŸ‘‰πŸΌ Only if it doesn't fit on one πŸ’»

⚠️ Use it only if you must ⚠️

But don't panic!

Let me share a secret

🀫

My wife tells it to me all the time!

Size doesn't really matter

What matters are the insights πŸ‘

Insights ?

Machine Learning?

Machine Learning?

Classical programming: data+rules = answers. Machine Learning data+answers=rules

What can you use ML for?

  • Photos πŸ–Ό
  • Videos πŸ“Ή
  • Document/Text Processing πŸ“ƒ
  • Speech πŸ‘„πŸ‘‚πŸΌ
  • Structured data πŸ’Ύ?

What can I detect on photos πŸ–Ό?

  • Objects 🐈 πŸ• 🍎
  • Faces πŸ‘±πŸ½β€β™‚οΈπŸ‘±β€β™€οΈ
  • Celebrities 🍾
  • Landmarks πŸ—Ό
  • Text in images πŸ—Ό
Video πŸ“Ή is about the same but on streaming

How can I use it?

Develop locally

Pose Detection

https://johnguerra.co/viz/mlPose/

Object Detection

https://johnguerra.co/viz/mlObject/

How can I use it?

Demos

What can I do with documents πŸ“ƒ?

  • OCR πŸ–Ό β†’ πŸ”€
  • Sentiment analysis πŸ˜†πŸ˜‘
  • Topic extraction 🟑🟠🟣
  • Entities detection
  • Political Affiliation? πŸ‘”πŸŽ‰
  • Psychological Profile?

Demos

What can I do with Speech πŸ‘„πŸ‘‚πŸΌ?

  • Speech recognition πŸ‘‚πŸΌ
  • Speech generation πŸ‘„

That's hip, but...

The purpose of visualization is insight, not pictures

The purpose of data analytics is insight, not (just) models

Machine Learning

  • Prediction vs Training
  • How was it trained?
  • Garbage in - garbage out

ML vs InfoVis

How is Rappi doing on Twitter?

  • 30k tweets in a week of 2019

Approach 1

πŸ˜‘πŸ˜ πŸ˜’πŸ˜πŸ˜πŸ˜ƒπŸ₯°?

  • Machine learning 🎩! ???
  • Detects sentiment ! ???

I hired a data πŸ’ (might be me)

Analyzed 180 tweets

  • πŸ˜‘πŸ˜ πŸ˜’πŸ˜πŸ˜πŸ˜ƒπŸ₯°

Here are some of them

Rappi tweet
😐 -10%
Rappi tweet
😑 -80%
Rappi tweet
πŸ₯° 80%
Rappi tweet
😐 -10%
Rappi tweet
😐 -20%
Rappi tweet
πŸ₯° 90%
Rappi tweet
πŸ˜’ -40%
Rappi tweet
πŸ˜’ -30%

Would you hire this data πŸ’?

Well, actually

  • It wasn't a data πŸ’
  • It was a πŸ’»
  • Would you use it?

Well, actually, actually

Will you trust it?

I don't

Approach 2

Approach 3

Explore the tweets on your own

It's up to you!

  • Interactivity πŸ‘‰ Ask questions
  • Slice and dice
  • Overview first, Zoom/Filter, then details on demand

Rappi Dashboard Link πŸ˜‰

Β‘No coma Machine Learning, coma πŸ–!

Bonus

Types of Visualization

  • Infographics
  • Scientific Visualization (sciviz)
  • Information Visualization (infovis, datavis)

Infographics

Scientific Visualization

  • Inherently spatial
  • 2D and 3D

Information Visualization

Visualization Mantra

  • Overview first
  • Zoom and Filter
  • Details on Demand

Data Types

1-D LinearDocument Lens, SeeSoft, Info Mural
2-D MapGIS, ArcView, PageMaker, Medical imagery
3-D WorldCAD, Medical, Molecules, Architecture
Multi-VarSpotfire, Tableau, GGobi, TableLens, ParCoords,
TemporalLifeLines, TimeSearcher, Palantir, DataMontage, LifeFlow
TreeCone/Cam/Hyperbolic, SpaceTree, Treemap, Treeversity
NetworkGephi, NodeXL, Sigmajs

Take home messages

  • Data Analytics is way more than just models
  • Focus on insights!!!
  • Infovis: Choose the best marks and channels

What's the best School for my Nephew

Colombian Highschools
https://johnguerra.co/viz/saber11/

Remember

  • πŸ‘‰πŸΌ Insights! πŸ‘ˆπŸΌ
  • Users, tasks and data

John Alexis Guerra GΓ³mez

johnguerra.co
@duto_guerra

Other Insights

FDA

Task: Change in drug's adverse effects reports

User: FDA Analysts

State of the art

https://treeversity.cattlab.umd.edu/

Health insurance claims

Task: Detect fraud networks

User: Undisclosed Analysts

Clustering

Overview

Ego distance

Who to follow on Twitter

https://johnguerra.co/slides/untanglingTheHairball/#/

Who am I?

PhD

Silicon Valley

Many other projects

Big Data Technologies

Technologies

  • MapReduce (Hadoop, Hive, pig, Spark ...)
  • NoSQL Databases (Redis, Cassandra, MongoDB, Neo4J)
  • Distributed Relational (SQL) Databases (MySQL, PostgreSQL, Oracle, SqlServer)
  • Many others

Hadoop

  • Computing platform for big data
  • Uses clusters for storing and processing the data

Hadoop Architecture

Spark

A distributed computing alternative of to map reduce.

  • Easier to use
  • Integrates better with traditional programming models

NoSQL Databases

  • Scalable storage platforms that use techniques different to traditional SQL databases
  • Sacrifices features for performance

Types of NoSQL

  • Column Oriented: Cassandra, HBase, Redshift ...
  • Key-value: Redis, memcached, Aerospike ....
  • Document based: MongoDB, CouchDB, DynamoDB ...
  • Graph based: Neo4J, Titan, ...

Bonus

Introduction to NoSQL for Web Developers

Distributed Relational DB

  • You can also use traditional databases on a distributed way.
  • Divides the database into shards.
  • Usually doesn't scale that well.

Others

  • Google DataFlow
  • Google's replacement for MapReduce based on flows.
  • Supposed to scale better.
  • AFAIK can only be used with Google's Cloud.

How to make sense of data?

  • Statistical Analysis
  • Machine Learning and Artificial Intelligence
  • Visual Analytics (and data analytics)

Visual Analytics

Traditional

  • Query for known patterns
  • Display results using traditional techniques

Pros:
  • Many solutions
  • Easier to implement

Cons:
  • Can’t search for the unexpected

Data Mining/ML

  • Based on statistics
  • Black box approach
  • Output outliers and correlations
  • Human out of the loop

Pros:
  • Scalable

Cons:
  • Analysts have to make sense of the results
  • Makes assumptions on the data

InfoVis

  • Visual Interactive Interfaces
  • Human in the loop

Pros:
  • Visual bandwidth is enormous
  • Experts decided what to search for
  • Identify unknown patterns and errors in the data

Cons
  • Scalability can be an issue