Wednesday, August 1, 2018

Descriptive Statstics


Descriptive Statistics


Datasets - List of things or bunch of numbers

Examples:

                Weight of 20 Students
                122, 118, 56, 54, 77, 90, 116, 88, 72, 56
                112, 123, 55, 74, 87, 100, 96, 68, 55, 61

                Hair Color of 20 Students
                Brown, Red, Black, Black, Black, Black, Red, Brown, Red, Black
                Black, Red, Brown, Black, Black, Red, Brown, Brown, Brown, Black
               
Sample Size (n) = total count of Observations of Dataset   (ex: n=20 in above cases)

Data Classifications:

 Discrete:

Have natural categories (Ex: Hair color of 20 students Red, Brown, Black)

 Continuous:

Have continuous numerical data (Ex: Weight of 20 Students)

Distribution of Discrete Datasets:

Simply classify the data/observations into categories
               
Eg: Left Handed Person (L)/ Right Handed Person (R)
L, R, L, L, L, R, R, R, R, L
R, L, R, L, R, L, L, L, R, L
               
Only have 2 categories
Right(R) = 9
Left (L) = 11
               

Inference:

Sample Proportion (occurrence of Right handed people in sample): 9/20
Sample Proportion (occurrence of Left handed people in sample): 11/20
Left handed people are more





Distribution of Continuous Datasets:

No natural values, hence create categories


Steam Leaf Plot:

Create Category1 as 2Digit. All values falling under this interval will have first 2 digits 12 or 13 or 14 or 15. This is called “Stem”. “Leaf”= All digits left out of Stem.
Example:
126 => Stem =12; Leaf=6
1298 => Stem=12; Leaf=98
138=>Stem=13; Leaf=8

Applying the 2Digit technique on above data:


Again classify the “Stems” as Stem1= 0-4 digits under one category and Stem2=5-9 digits as other

Dot Plot:

Draw a line (scale) of maximum and minimum range and plot the values
Observations with same value put second dot above first dot.

For the example data of Etruscan:

Comparative Dot Plots





5-Point method Descriptive Statistics–Continuous data:


Minimum = 126
Maximum=158
Range=Maximum – Minimum = 32
Range also termed as measure of scale/dispersion /noise
Outliers – Points far from rest of data
Median = Robust in statistics. Center item in sorted list
If n=25 then median = (n+1)/2 = 26/2=13th item = 146

(25%) First Quartile (Q1) = Center point of 1st half up to median (n/4)
50% Quartile (Q2) = 50% = n/2
(75%) Third Quartile (Q3) = Center point of last half from median (n/8)
Inter Quartile range (IQR) = Q3 – Q1
As per 5 Point method for the above dataset
Min=126, Q1=146, Median=142, Q3=150, Max=158, Range=32
Stem Leaf Plot:
Graphical representation of values which will easily define 5 point method.
                                |-------------------              -------|
Min----------       Q1                                    Q3----------Mx
                                |---------------------------|

Outliers and Box Plot
Points not belong to the range or dataset is termed as outliers
               

Comparing datasets:

Shift of 13 between Etruscan and Italian is the Medians out of 5 Points (146-133)=13. This is termed as Location Problem

Measure of Center:

Sample Mean – average (sum of numbers / number of items)
Median =11
Sample Mean = 84/7 =12
Mean is sensitive to outliers

s=standard deviation
Walsh Average/Hodges-Lehmann Estimates:
For a simple data sets order and make tables as rows and columns.
For pair of data points we calculate only once. Calculate average pairs associated with row & column elements.

Measure of Scale of Noise:

Range, IQR
Variance = Measure of deviation from mean

Set1=11, 18, 6,4,8,15,22
Sample mean =12
Deviations =-1, 6, 6,-8, 3, 10 (xn-12)
Squared Deviations= s2   =1, 36, 36, 64, 9, 100 =246/6=41
s=  =6.4









               











               

               
               
               
               


No comments:

Post a Comment