Descriptive Statistics
Datasets - List of things
or bunch of numbers
Examples:
Weight of 20 Students
122,
118, 56, 54, 77, 90, 116, 88, 72, 56
112,
123, 55, 74, 87, 100, 96, 68, 55, 61
Hair Color of 20 Students
Brown,
Red, Black, Black, Black, Black, Red, Brown, Red, Black
Black,
Red, Brown, Black, Black, Red, Brown, Brown, Brown, Black
Sample Size (n) = total
count of Observations of Dataset (ex:
n=20 in above cases)
Data Classifications:
Discrete:
Have natural categories (Ex: Hair color of 20 students
Red, Brown, Black)
Continuous:
Have continuous numerical data (Ex: Weight of 20
Students)
Distribution of Discrete Datasets:
Simply classify the data/observations into categories
Eg: Left Handed Person (L)/ Right Handed Person (R)
L, R, L, L, L, R, R, R, R, L
R, L, R, L, R, L, L, L, R, L
Only have 2 categories
Right(R) = 9
Left (L) = 11
Inference:
Sample Proportion (occurrence of Right handed people in
sample): 9/20
Sample Proportion (occurrence of Left handed people in sample):
11/20
Left handed people are more
Distribution of Continuous Datasets:
No natural values, hence create
categories
Steam Leaf Plot:
Create Category1 as 2Digit. All values falling under this
interval will have first 2 digits 12 or 13 or 14 or 15. This is called “Stem”.
“Leaf”= All digits left out of Stem.
Example:
126 => Stem =12; Leaf=6
1298 => Stem=12; Leaf=98
138=>Stem=13; Leaf=8
Applying the 2Digit technique
on above data:
Again classify the “Stems” as
Stem1= 0-4 digits under one category and Stem2=5-9 digits as other
Dot Plot:
Draw a line (scale) of maximum
and minimum range and plot the values
Observations with same value
put second dot above first dot.
For the example data of
Etruscan:
Comparative Dot Plots
5-Point method Descriptive Statistics–Continuous data:
Minimum = 126
Maximum=158
Range=Maximum – Minimum = 32
Range also termed as measure of scale/dispersion /noise
Outliers – Points far from rest of data
Median = Robust in statistics. Center item in sorted list
If n=25 then median = (n+1)/2 = 26/2=13th item =
146
(25%) First Quartile (Q1) = Center point of 1st
half up to median (n/4)
50% Quartile (Q2) = 50% = n/2
(75%) Third Quartile (Q3) = Center point of last half from
median (n/8)
Inter Quartile range (IQR) = Q3 – Q1
As per 5 Point method for the above dataset
Min=126, Q1=146, Median=142, Q3=150, Max=158, Range=32
Stem Leaf Plot:
Graphical representation of values which will easily define
5 point method.
|------------------- -------|
Min---------- Q1 Q3----------Mx
|---------------------------|
Outliers and Box Plot
Points not belong to the range or dataset is termed as
outliers
Comparing datasets:
Shift of 13 between Etruscan and Italian is the Medians out of 5 Points (146-133)=13.
This is termed as Location Problem
Measure of Center:
Sample Mean – average (sum of numbers / number of items)
Median =11
Sample Mean = 84/7 =12
Mean is sensitive to outliers
Walsh Average/Hodges-Lehmann Estimates:
For a simple data sets order and make tables as rows and
columns.
For pair of data points we calculate only once. Calculate
average pairs associated with row & column elements.
Measure of Scale of Noise:
Range, IQR
Variance = Measure of deviation from mean
Set1=11, 18, 6,4,8,15,22
Sample mean =12
Deviations =-1, 6, 6,-8, 3, 10 (xn-12)
Squared Deviations= s2 =1, 36, 36, 64, 9, 100 =246/6=41
No comments:
Post a Comment