Week 4 assignment asks to prepare univariate and bivariate graphs of the variables in the study and provide a summary of the graphs.
Program
The UNIVARIATE Procedure

Summary The graph clearly shows the right skewed distribution of the income meaning that a few countries have a large income per person. The data point at the extreme right hand of the graph shows that 0.5% of the 190 countries which equals 0.95 or approx 1 country has an income larger than $100000. The histogram graph is plotted with $1000 as the bin-width.
Categorical univariate plot of income category
Summary The graph shows the income category percentage distribution for all countries. More than 50% of the countries have an income level less than $5000.
Univariate procedure of armed forces rate Univariate procedure report for armedforcesrate variable
The UNIVARIATE Procedure
Program
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.gapminder;
/*
Formatting for income, armed forces and oil
*/
format oilcategory $35.;
format incomecategory $20.;
format armedforcescategory $20.;
/*
Oil categorisation
*/
if oilperperson = . then oilcategory='Missing';
else if oilperperson lt 2 then oilcategory= 'Less than 2 tonnes per year';
else if oilperperson lt 4 then oilcategory= '2 to 4 tonnes per year';
else if oilperperson lt 6 then oilcategory= '4 to 6 tonnes per year';
else if oilperperson lt 8 then oilcategory= '6 to 8 tonnes per year';
else if oilperperson gt 8 then oilcategory= 'Greater than 8 tonnes per year';
/*
Income per person categorisation
*/
if incomeperperson = . then incomecategory='Missing';
else if incomeperperson lt 5000 then incomecategory = 'Less than $5000';
else if incomeperperson lt 10000 then incomecategory = '$5000 to $10000';
else if incomeperperson lt 15000 then incomecategory = '$10000 to $15000';
else if incomeperperson lt 20000 then incomecategory = '$15000 to $20000';
else if incomeperperson lt 30000 then incomecategory = '$20000 to $30000';
else if incomeperperson gt 30000 then incomecategory = '$30000 and higher';
/*
Armed forces categorisation
*/
if armedforcesrate= . then armedforcescategory='Missing';
else if armedforcesrate lt 1 then armedforcescategory = 'Less than 1%';
else if armedforcesrate lt 2 then armedforcescategory = '1% to 2%';
else if armedforcesrate lt 4 then armedforcescategory = '2% to 4%';
else if armedforcesrate lt 6 then armedforcescategory = '4% to 6%';
else if armedforcesrate lt 8 then armedforcescategory = '6% to 8%';
else if armedforcesrate gt 8 then armedforcescategory = 'greater than 8%';
/*
Democracy score categorisation
*/
if polityscore = . then politycategory='Missing';
else politycategory = polityscore;
/*
Insert meaningful lables to the variables
*/
label country = "country"
oilperperson = "Oil per person"
incomeperperson = "Income per person ($)(based on 2010 dollar exchange rate)"
incomecategory = "Income category"
polityscore = "Democracy score"
armedforcesrate = "Armed forces personnel (% of total labor force)"
incomecategory = "Income category"
armedforcescategory = "Armed forces category"
oilcategory = "Oil category"
politycategory = "Democracy Score category";
run;
/*
Univariate report of income per person
*/
PROC UNIVARIATE Data=new; VAR incomeperperson;
/*
Quantitative univariate plot of
income per person for all countries
*/
proc sgplot data=new;
histogram incomeperperson /showbins binstart=0 binwidth=1000;
yaxis grid values=(0 to 20 by 1) label='Percent of countries';
title 'Quantitative univariate plot of income per person for all countries';
run;
/*
Categorical variable plot of
income per person for all countries.
*/
/**/
proc sgplot;
vbar incomecategory/stat=pct barwidth=0.5;
yaxis label='Percent of countries';
title 'Categorical univariate plot of income category for all countries';
run;
/*
Univariate report of armed forces rate
*/
title 'Univariate procedure report for armedforcesrate variable';
PROC UNIVARIATE Data=new; VAR armedforcesrate ;
/*
Quantitative univariate plot of
armed forces for all countries
*/
proc sgplot data=new;
histogram armedforcesrate /showbins binstart=0 binwidth=0.2;
yaxis grid values=(0 to 15 by 1) label='Percent of countries';
title 'Quantitative univariate plot of armed forces rate';
run;
/*
Categorical variable plot of
armed forces category for all countries
*/
PROC sgplot;
vbar armedforcescategory/ stat=pct barwidth=0.5;
xaxis label='Armed forces categories';
yaxis grid values=(0 to 0.50 by 0.02) label='Percent of countries';
title 'Categorical univariate plot of armedforcescategory';
run;
/*
Scatter plot of response variable (Democracy score) versus explanatory variable (incomeperperson)
This plot does not show any causal relationship between income and democracy. This fact has also been seen in
the paper by Daron et al. where the papers concludes that no relationship exists between income and democracy
*/
proc sgplot data=new;
scatter x=incomeperperson y=polityscore;
xaxis grid;
yaxis grid;
title 'Scatter plot of response variable (Democracy score) versus explanatory variable (incomeperperson)';
run;
/*
Scatter plot of response variable (oilperperson) versus explanatory variable (incomeperperson)
Data shows a linear relationship between oil consumption
and incomeperperson which makes sense. One would expect that as the income increases of a country then
its oil consumption would increase due to increased standards of living.
*/
proc sgplot data=new;
scatter x=incomeperperson y=oilperperson;
xaxis grid;
yaxis grid;
title 'Scatter plot of response variable (oilperperson) versus explanatory variable (incomeperperson)';
run;
Univariate procedure of incomeperperson
The UNIVARIATE Procedure
Variable: incomeperperson (Income per person ($)(based on 2010 dollar exchange rate))
| Moments | |||
|---|---|---|---|
| N | 190 | Sum Weights | 190 |
| Mean | 8740.96608 | Sum Observations | 1660783.55 |
| Std Deviation | 14262.8091 | Variance | 203427723 |
| Skewness | 3.25047792 | Kurtosis | 14.6656757 |
| Uncorrected SS | 5.29647E10 | Corrected SS | 3.84478E10 |
| Coeff Variation | 163.171999 | Std Error Mean | 1034.73292 |
| Basic Statistical Measures | |||
|---|---|---|---|
| Location | Variability | ||
| Mean | 8740.966 | Std Deviation | 14263 |
| Median | 2553.496 | Variance | 203427723 |
| Mode | . | Range | 105044 |
| Interquartile Range | 8681 | ||
| Tests for Location: Mu0=0 | ||||
|---|---|---|---|---|
| Test | Statistic | p Value | ||
| Student's t | t | 8.447558 | Pr > |t| | <.0001 |
| Sign | M | 95 | Pr >= |M| | <.0001 |
| Signed Rank | S | 9072.5 | Pr >= |S| | <.0001 |
| Quantiles (Definition 5) | |
|---|---|
| Level | Quantile |
| 100% Max | 105147.438 |
| 99% | 81647.100 |
| 95% | 33945.314 |
| 90% | 26901.858 |
| 75% Q3 | 9425.326 |
| 50% Median | 2553.496 |
| 25% Q1 | 744.239 |
| 10% | 337.318 |
| 5% | 242.678 |
| 1% | 115.306 |
| 0% Min | 103.776 |
| Extreme Observations | |||
|---|---|---|---|
| Lowest | Highest | ||
| Value | Obs | Value | Obs |
| 103.776 | 42 | 39972.4 | 145 |
| 115.306 | 30 | 52301.6 | 112 |
| 131.796 | 59 | 62682.1 | 21 |
| 155.033 | 108 | 81647.1 | 110 |
| 161.317 | 80 | 105147.4 | 128 |
| Missing Values | |||
|---|---|---|---|
| Missing Value | Count | Percent Of | |
| All Obs | Missing Obs | ||
| . | 23 | 10.80 | 100.00 |
Summary
1. The data shows that the mean is greater than the median. This suggests that data is not symmetric. The direction can be interpreted as right skewed as some of the data is pulling the mean towards higher values. This makes sense as income distributions are usually right skewed suggesting few countries earning high incomes.
2. The standard deviation seems low (14263) as it does not seem to be catching the extremes. As per the normal distribution, around 68% of the data should be covered within 1 standard deviation. In an earlier post I created a frequency distribution of the income per person for all countries and I can see that around 78% of the data is covered by incomes in the combined range of 'less than $5000' , '$5000 to $10000', '$10000 to $15000' and '$15000 to $20000' i.e. covering the mean + sd range = 8740 + 14623 = 23363. The cumulative range equals 54% + 13% + 4.7% + 6.5%= 78%. This is greater than the 68% that the univariate procedure has returned in the standard deviation value. This suggests that the data is not normally distributed. It does suggest that a large portion of the income is in the lower income group.
Quantitative univariate plot of income per person
The UNIVARIATE Procedure
Variable: armedforcesrate (Armed forces personnel (% of total labor force))
| Moments | |||
|---|---|---|---|
| N | 164 | Sum Weights | 164 |
| Mean | 1.44401628 | Sum Observations | 236.81867 |
| Std Deviation | 1.70900751 | Variance | 2.92070667 |
| Skewness | 2.80701143 | Kurtosis | 9.43821248 |
| Uncorrected SS | 818.045203 | Corrected SS | 476.075188 |
| Coeff Variation | 118.350986 | Std Error Mean | 0.13345107 |
| Basic Statistical Measures | |||
|---|---|---|---|
| Location | Variability | ||
| Mean | 1.444016 | Std Deviation | 1.70901 |
| Median | 0.930638 | Variance | 2.92071 |
| Mode | . | Range | 10.63852 |
| Interquartile Range | 1.13473 | ||
| Tests for Location: Mu0=0 | ||||
|---|---|---|---|---|
| Test | Statistic | p Value | ||
| Student's t | t | 10.82057 | Pr > |t| | <.0001 |
| Sign | M | 81.5 | Pr >= |M| | <.0001 |
| Signed Rank | S | 6683 | Pr >= |S| | <.0001 |
| Quantiles (Definition 5) | |
|---|---|
| Level | Quantile |
| 100% Max | 10.638521 |
| 99% | 9.820127 |
| 95% | 5.406536 |
| 90% | 3.290807 |
| 75% Q3 | 1.613217 |
| 50% Median | 0.930638 |
| 25% Q1 | 0.478489 |
| 10% | 0.234286 |
| 5% | 0.134730 |
| 1% | 0.066100 |
| 0% Min | 0.000000 |
| Extreme Observations | |||
|---|---|---|---|
| Lowest | Highest | ||
| Value | Obs | Value | Obs |
| 0.000000 | 82 | 5.95585 | 187 |
| 0.066100 | 86 | 6.39494 | 174 |
| 0.102269 | 132 | 7.73791 | 90 |
| 0.105115 | 150 | 9.82013 | 59 |
| 0.114593 | 116 | 10.63852 | 100 |
| Missing Values | |||
|---|---|---|---|
| Missing Value | Count | Percent Of | |
| All Obs | Missing Obs | ||
| . | 49 | 23.00 | 100.00 |
Summary
1. The median is less than the mean suggesting right skewness of the armed forces personnel.
2. The spread 1.7 from the mean of 1.4 means that according to the normal distribution this should be 1.4 + 1.7 upper and 1.4 - 1.7 lower. The upper level to constitute 68% of the data should therefore fall at 3.1. The frequency distribution of the armed forces data suggests that already 69% of the data is covered in the ranges 'Less than 1%', '1% to 2%' and '2% to 4%' only in the upper direction. This means the data is not normally distributed and is largely right skewed.
Quantitative plot of armed forces rate
Summary
This plot visually shows the claim of right skewness of the armed forces data. The bin-width is st at 0.2. This needs to be read percentage of countries where the armed forces rate is some x%.
Categorical plot of armedforces category
Summary
This plot shows the univariate categorical distribution of armed forces percentage in all countries. It can be interpreted as the percentage of countries with armed forces in the range of 1% to 2% is 20% for example.
Scatter plot of democarcy score versus incomeperperson
Summary
Plotting a scatter plot of response variable democracy score versus explanatory variable incomeperperson shows no real causal relationship between the two variables. Even culling the income to be below $40000 does not show a causal relationship. This claim that any relationship existing between income and democracy has also been challenged in the paper by Daron et al.
Scatter plot of oilperperson versus incomeperperson
Summary
1. This plot shows a positive relationship between the response variable (oilperperson) and the explanatory variable (incomeperperson). Intuitively this makes sense as higher the income usually results in higher standards of living causing higher oil consumption.
2. The one country where the oil consumption is the largest and the income is in the upper bracket of the income scale is Singapore. An interesting point to study would be why is Singapore consuming so much oil compared to countries within the same bracket of income. Has it got anything to do with the large armed forces variable or is there some other variable influencing the oil consumption? A nice topic for regression analysis.
Comments
Post a Comment