Sunday, August 26, 2018

2. Design


    1. Understand the context - this means knowing your audience and conveying a clear message about what you want your audience to know or do with the information you are providing.
    2. Choose an appropriate visual display - this was covered in the last lesson. Check out the lesson titled recap in the previous section if you need a quick refresher. 
    3. Eliminate clutter - you should only provide information to the user that helps convey your message.
    4. Focus attention where you want it - build visualizations that pull attention to the message you want to highlight.  
    5. Think like a designer - you will learn a number of design principles in this lesson to assist as you start to put together your own data visualizations.
    6. Tell a story - your visualizations should give the audience a story. The most powerful data visualizations move people to take action.

    There are two main reasons for creating visuals using data:
    1. Exploratory Analysis is done when you are searching for insights. These visualizations don't need to be perfect. You are using plots to find insights, but they don't need to be aesthetically appealing. You are the consumer, and you need to be able to find the answer to your question from these plots.
    2. Explanatory Analysis is done when you are providing your results for others. These visualizations need to provide you the emphasis you need to convey your message. They should be accurate, insightful, and visually appealing. 

    The five steps of the data analysis process:
    1. Extract - Obtain the data from a spreadsheet, SQL, the web, etc.
    2. Clean - Here we could use exploratory visuals.
    3. Explore - Here we use exploratory visuals.
    4. Analyze - Here we might use either exploratory or explanatory visuals.
    5. Share - Here is where explanatory visuals live.

    Bad Visualization
    1. Don’t convey the message.
    2. Are misleading

    Chart Junk
    • Examples of chart junks


    Data Ink Ratio
    = amount of ink used to describe the data/ amount of ink used to describe everything else

    The data-ink ratio, credited to Edward Tufte, is directly related to the idea of chart junk. The more of the ink in your visual that is related to conveying the message in the data, the better.
    Limiting chart junk increases the data-ink ratio.


    • Graph with the high data-ink ratio is much easier to understand as it reduces down to a graph much simpler to understand by extracting the extraneous elements such as the background color.
    • Below chart also shows: low vs. high data-ink ratio



    Design Integrity




    • If the lie factor is above 1, the data is considered to be misleading.

    Using Color
    1. Before adding color to a visualization, start with black and white.
    2. When using color, use less intense colors - not all the colors of the rainbow, which is the default in many software applications. - use less intense colors(pastel, grey …etc.)
    3. Color for communication. Use color to highlight your message and separate groups of interest. Don't add color just to have color in your visualization.

    Designing for Color Blindness
    • Use color pallets that do not move from red to green. Instead, use colors on a blue to orange pallet.

    Shape, Size and Other Tools
    • Categorical: color and shape
    • Qualitative: size of marker

    Bad Visuals can be avoided by:
    1. Maintaining a large data-ink ratio and removing unnecessary items from visuals.
    2. Choosing visual encodings that work to highlight insights.
    3. Maintaining data integrity in the visual.

    For explanatory visuals:
    1. Focus the audience's attention on the insight you want them to act on.
    2. Use color only when necessary. Simple is often better.
    Focus on SIMPLICITY!!
    1. Tell a story.

    Tell a story
    1. Start with a Question
    2. Repetition is a Good Thing
    3. Highlight the Answer
    4. Call Your Audience To Action

    Recap
    1. How to build data visualizations for explanatory purposes.
    2. How visual encodings impact our abilities as humans to accurately perceive that information.  
    3. How chart junk distracts from the message - maximize your data-ink ratio.
    4. How to calculate lie factors and the importance of design integrity.
    5. How to use more advanced design features like color, shape, and size - make sure these are truly enhancing your message!
    6. The importance of story telling in communicating with data.





1. Intro to Data Visualization



    For qualitative data, if we are just looking at one column worth of data,
    We have these 4 common visuals:
    1. Histogram
    1. Normal Quantile Plot
    2. Stem and Leaf Plot
    3. Box and Whisker Plot

    For categorical data, if we are looking at just one variable(column), we have three common visuals:
    1. Bar Chart
    1. Pie Chart
    2. Pareto Chart

    Scatter Plots
    • Correlation coefficient: the strength and direction of a linear relationship
    • Commonly denoted by r (-1 < r < 1)
      • +1: positive linear correlation
      • -1: negative linear correlation
      • 0: no linear relationship
    • Strength: Weak, Moderate, Strong
    • Direction: Positive, Negative, None

    Strong                               Moderate                               Weak
    0.7≤∣r∣≤1.00.7≤∣r∣≤1.0              0.3≤∣r∣<0.70.3≤∣r∣<0.7                     0.0≤∣r∣<0.30.0≤∣r∣<0.3



    Line Plots
    • Line plots are a common plot for viewing data over time.
    • Allow us to quickly identify overall trends, seasonal occurrences, peaks, and valleys in the data.
    Ex) stock prices over time

    Recap
    1. You motivated the need for data visualization by showing that summary statistics don't tell the full story. You saw datasets where the summary statistics were the same, but the actual data were very different!
    2. You did a review of data types. In general there are quantitative and categorical variables. Quantitative variables can be either discrete or continuous, while categorical variables are either ordinal or nominal.  
    3. You looked at univariate plots. In most cases a histogram should be used for quantitative data, while a bar chart should be used for categorical data. There are some cases where you might use one of the other plots.
    4. You then looked at bivariate plots, where you were comparing two variables to one another.  Scatter plots are the most common way to visualize two quantitative variables, while a line chart is common for data that you are watching over time. If you are comparing two categorical variables, the best choice is probably a side-by-side bar chart.
    5. You learned about correlation coefficients, which provide the strength and direction of linear relationships. You learned a rule of thumb for determining whether the relationship between two quantitative variables is strongmoderate, or weak.
    6. You then looked at cases where we had more than two variables. You learned that using these plots effectively is about building the plot that helps you see the insight that answers the question you have.
    7. You gained some insight into visual encodings and data dashboards, which will be a part of the next lessons!




Tuesday, August 14, 2018

SQL Aggregation Functions

COUNT
SUM
Min and Max
Average

NULL
: no data, different from 0 or no data
: it is a property,
so in a query, after where clause, it should be:
IS NULL
IS NOT NULL

COUNT
only counts values that are not NULL
SELECT COUNT(*)
FROM accounts;

SUM
- Can only be used with numerical values
- NULL values are considered as '0'

MIN/MAX
- Min will return the lowest number
- Max will return the highest number/ the latest date/ the non-numerical value closest alphabetically to 'z'

AVG
- Ignores NLLL values
- Can only be used for numerical data

GROUP BY
- Group by can be used to aggregate data within subsets of the data. For example, grouping for different accounts, different regions, or different sales representatives.
- Any column in the SELECT statement that is not within an aggregator must be in the GROUP BY clause.
- The GROUP BY always goes between WHERE and ORDE BY.
- ORDER BY works like SORT in spreadsheet software.

DISTINCT
- returns unique rows

HAVING
- 'clean' way to filter a query that has been aggregated, but this is also commonly done using a subquery.
- Any time you want to perform a WHERE on an element of your query that was created by an aggregate, you need to use HAVING instead.

DATE
 - DATE TRUNC




CASE
SELECT account_id, CASE WHEN standard_qty = 0 OR standard_qty IS NULL THEN 0
                        ELSE standard_amt_usd/standard_qty END AS unit_price
FROM orders

LIMIT 10;

SQL Joins

Database Normalization:  Are the tables storing logical groupings of the data? Can I make changes in a single location, rather than in...