## 1. Introduction

Data from the World Health Organization (WHO) relating to social, economic, health and political indicators are compiled by this organization, and are available as a file called WHO.csv. This file, as the name indicates, is a Comma Separated Value file. This CSV file has 358 columns, and 202 rows, one row pertaining to each country of the world. As examples, some of the columns are titled "Country", "Continent", "Population (in thousands) total", and "Number of confirmed poliomyelitis cases".

In this article, we try to get some meaningful data from this file, by means of the R programming language. About two years ago, I attended an online course called the The Analytics Edge offered by MIT, on the edX online platform. They had introduced the R language using a reduced form of the WHO CSV file mentioned above. We try to introduce the R language using a different version of the reduced form of the WHO file. Before we embark on a journey of R, let us take a look at the contents of this reduced WHO data file. This is a file which has the same 202 rows as above, but has only 15 columns, so that our understanding is simpler. This reduced WHO data file is available for download as WHOReduced.csv, at the top of this page. The columns of this reduced WHO data file are:

**Country**: Name of the country.**CountryID**: Unique numerical ID for the country.**Continent**: Numerical ID of the continent to which this country belongs, having one of 7 values.**AdultLiteracyRate**: In percentage, for the country.**GNI**: Gross National Income of the country, per capita.**Population**: of the country in thousands.**PopGrowth**: Population growth rate as a percentage.**UrbanPop**: Urban population of the country as a percentage.**BPLPop**: Population of the country, below the poverty line, as a percentage.**MedianAge**: The median age of the population, in years.**Above60**: Percentage of the population of the country, above 60 years of age.**Below15**: Percentage of the population of the country, below 15 years of age.**FertilityRate**: Fertility rate as a percentage.**HospitalBeds**: Number of hospital beds per 1000 people.**NumberOfPhysicians**: in the country.

We use this reduced data set to understand some nuances of the data, and use the R programming language for this.

## 2. Introduction to R

R is a software environment for data analysis, statistical computing and graphics. It is also a programming language, which enables one to code a set of steps to achieve a statistical or machine learning outcome. R is open-source. Though there are many choices for data analysis software like SAS, Stata, SPSS, Microsoft Excel, Matlab, Minitab, pandas, we will be using R for purposes of this article.

The latest version of R can be downloaded from here. There are some graphical user interfaces for R, for example, RStudio and Rattle. However, for purposes of this article, we will be using the command line interface for R, and running the commands through the R console. This is shown below for the version of R that I have.

In the remainder of this article, we get introduced to R by a series of questions and their corresponding answers.

## 3. Getting useful information from the WHO file, using R commands

In this section, we pose a set of questions and get answers to these using R commands. This will serve as our introduction to R.

**How do I read in the CSV data into R?**Data from a CSV file can be loaded onto R by reading it into a

*data frame*. Before getting into data frames, we need to know what a*vector*is. A vector is a series of numbers or characters stored as the same object. For example, the R command`v = c(1, 2, 3, 4, 5)`

creates a vector named`v`

, and this vector has five elements, the numbers 1, 2, 3, 4, 5. It is not correct to combine characters and numbers in the same vector. Two or more vectors of the same length can be combined into a data frame, which is an important data structure in R. If we consider two vectors`v1 = c(1, 2, 3, 4, 5)`

and`v2=c(100, 200, 300, 400, 500)`

, then these two can be combined into a single data frame which has five rows and two columns, with the first column being the first vector`v1`

, and the second column being the second vector`v2`

. In its simplest form, a data frame can be construed as a matrix. However, a data frame is more general than a matrix since the different columns can have quantities of different data types, as we see below.Since we are working with a CSV file, R has a simple command to read in the entire CSV file into a single data frame. You will have to use the R menu to change directory to the folder where the file WHOReduced.csv is located, before executing this command.

C#Copy Code> who = read.csv("WHOReduced.csv")

This command loads the entire CSV file into the data frame named`who`

. Just type this command into the R console, and hit Enter, for this command to run.Next, we take a look at the structure of this data.

**How do I start understanding the structure of this data?**R has a useful command called

`str`

which enables one to understand the structure of the data loaded into a data frame.C#Copy Code> str(who)

Upon running this command, the R console outputs the following output. Looking at this output, one can get to know that there are 202 observations of 15 variables. What this means is that there are 202 rows, with each row having 15 variables. The 15 different variables in this data frame are`Country, CountryID, Continent, AdultLiteracyRate, GNI, Population, PopGrowth, UrbanPop, BPLPop, MedianAge, Above60, Below15, FertilityRate, HospitalBeds, NumberOfPhysicians`

. Some of these variables are of`int`

type, containing integer values. Some others are of`num`

type containing floating point values. The first variable`Country`

is of type`Factor`

, which is a categorical variable. The above screenshot shows that`Country`

has 202 categories, aka levels, with each level being the unique country name.A small note on the continent labeling in this file. This is shown in the following table. These are strictly not the names of the continents, but we will take these for the purpose of this article.

Continent Label Continent Name 1 Eastern Mediterranean 2 Europe 3 Africa 4 North America 5 South America 6 Western Pacific 7 Asia Next, we take a look at the summary of this data.

**How do I get a summary of this data?**R has another useful command called

`summary`

which enables one to understand the summary of the data loaded into a data frame.C#Copy Code> summary(who)

Upon running this command, the R console outputs the following output. Looking at this output, we find that R has output a summary of all the 15 different variables within this data frame. For quantities which have numerical values, R has output these quantities - the minimum value, the first quartile value (which is the value for which 25 percent of the values fall below this value), the median value (the value for which 50 percent of the values fall below this), the mean, the third quartile (the value for which 75 percent of the values fall below this), and the maximum value. For example, for the variable`MedianAge`

, these values are Minimum = 15.00, First quartile = 20.00, Median = 25.00, Mean = 26.74, Third quartile = 35.00, and Max = 43. We also see an entry called`NA's : 23`

corresponding to the variable`MedianAge`

. This indicates that there are 23 entries for which the median age is not listed in the data set, and hence in the data frame. These 23 values are not available. In a similar manner, the summary of all the other 13 integer/numerical variables can be understood. For the factor variable`Country`

, the summary has listed the first six entries in the screenshot above.The R commands

`str()`

and`summary()`

are very helpful for getting information on the structure of the data, and the summary of the data respectively.Next, we pose some interesting questions on this data, and seek their answers.

**Which is the country having the minimum, and maximum population percentage under 15 years of age?**For answering this question, we need to identify the index of this country. The R command for this is:

C#Copy Code> which.min(who$Below15)

Upon running this command, the R console outputs the answer as 4. Now, the country name is found using the following command:C#Copy Code> who$Country[4]

The answer is Andorra. The above two commands can be combined into a single command as:C#Copy Code> who$Country[which.min(who$Below15)]

The yields the same answer as Andorra as the country which has the minimum percentage of population under 15 years of age.

Similarly, the following command can be used to find the country which has the maximum of this number:

C#Copy Code> who$Country[which.max(who$Below15)]

The answer to this is Uganda.**Which is the country having the minimum, and maximum population percentage over 60 years of age?**For answering these questions, as before, we type the command:

C#Copy Code> who$Country[which.min(who$Above60)]

The yields the answer as United Arab Emirates as the country which has the minimum percentage of population above 60 years of age.

Similarly, the following command can be used to find the country which has the maximum of this number:

C#Copy Code> who$Country[which.max(who$Above60)]

The answer to this is Japan.**Is there a country whose entire population is urban?**Looking at a summary of the data, it is seen that the maximum value of variable

`UrbanPop`

is 100. To find out the country whose entire population is urban, we use the command: For answering this question, as before, we type the command:C#Copy Code> who$Country[which.max(who$UrbanPop)]

The yields the answer as Monaco.

Similarly, the following command is used to find the country which has the minimum value for this number:

C#Copy Code> who$Country[which.min(who$UrbanPop)]

The answer to this is Burundi.**How does a plot of the GNI vs Fertility Rate look like?**For answering this question, as before, we plot the data using the command:

C#Copy Code> plot(who$GNI, who$FertilityRate)

The yields a plot as shown below.

**Which countries have a high GNI and high Fertility Rate?**For answering this question, as before, we take a subset of the original data as follows:

C#Copy Code> HighVals = subset(who, GNI > 10000 & FertilityRate > 2.5)

This creates a subset of the data where the GNI is greater than 10000 and Fertility Rate is greater than 2.5. To find out the number of countries which fall in this category, we use the command:

C#Copy Code> nrow(HighVals)

This gives the output as 9, indicating that there are 9 such countries. To identify the countries which fall in this category, we use the command:C#Copy Code> HighVals[c("Country", "GNI", "FertilityRate")]

This gives the result:**Which countries have a the highest and lowest ratio of number of doctors per person?**For answering this question, we add a vector to the original data set using the command:

C#Copy Code> who$DrsPop = who$NumberOfPhysicians / who$Population

Here, the ratio of the variable

`NumberOfPhysicians`

to the variable`Population`

is taken, and stored as a separate vector`DrsPop`

within the same data frame`who`

. To answer the above question, we use the commands:C#Copy Code> who$Country[which.min(who$DrsPop)] > who$Country[which.max(who$DrsPop)]

The answers to these questions are respectively San Marino (highest number of physicians per person) and Malawi (lowest number of physicians per person).A look at the structure of data using the

`str()`

command will yield 202 observations with 16 variables, with the 16th one being the one newly added`DrsPop`

.**How does the histogram of the number of Hospital Beds look like?**For answering this question, we plot the histogram using the command:

C#Copy Code> hist(who$HospitalBeds)

This shows the histogram as shown in the following figure. We see that this histogram is highly skewed, with a large number of countries having a low value for the number of hospital beds.**How does a box plot of the Population Growth against continent look like?**For answering this question, we plot the box plot using the command:

C#Copy Code> boxplot(who$PopGrowth ~ who$Continent, xlab = "Continent", ylab = "Population Growth")

This shows the box plot as shown in the following figure. From this boxplot, we see that there are some continents where the population growth rate is indeed negative. There are some continents where the interquartile range (the vertical height of the box) is quite small, indicating that there not much of a difference between the population growth rates across the continent. Outliers, where the distance from the first or third quartile is greater than the interquartile range is termed as an outlier, and is shown as a circle in the above plot.**How does a table of the**`Above60`

variable vary with Continent?For answering this question, we use the

`table`

command as follows:C#Copy Code> table(who$Above60, who$Continent)

This shows the table as shown below. From this table, we see that there are 11 countries in Continent 2 (Europe), having 22 percent of their population above 60 years of age.**Can we find out the average urban population on a Continent basis?**For answering this question, we use the

`tapply`

command as follows:C#Copy Code> tapply(who$UrbanPop, who$Continent, mean, na.rm=TRUE)

The`tapply(arg1, arg2, arg3)`

command takes three arguments, and groups`arg1`

by`arg2`

and applies`arg3`

. This means that in this case, the`tapply`

command groups the variable`UrbanPop`

by variable`Continent`

and applies the mean. The parameter`na.rm=TRUE`

in the above command is used to indicate to R to exclude the NA values from the computation.This shows the table below.

We see that the mean urban population is maximum in Continent 1 (Eastern Mediterranean), though Continent 4 (North America) is not far behind.**Can we find out the average population growth on a Continent basis?**For answering this question, we again use the

`tapply`

command as follows:C#Copy Code> tapply(who$PopGrowth, who$Continent, mean, na.rm=TRUE)

This shows the table below. We see that Continent 3 (Africa) has the highest average population growth, whereas Continent 2 (Europe) has the lowest.

## 4. Closure

In this article, we got introduced to looking at data in a CSV file using simple commands in R. The example file we used was `WHOReduced.csv`

which is a reduced version of the WHO data as of 2017. I have attempted to give an introduction to R by posing a set of simple but important questions on the data. We got introduced to the commands `read.csv(), str(), summary(), which.min(), which.max(), plot(), subset(), nrow(), hist(), boxplot(), table(), tapply()`

. I plan to continue writing articles on this in future, and cover other important analytics tools using the R language.

Meanwhile, I urge you to load your own CSV files, try out the commands listed above, and let me know your feedback on this.

## History

- Version 1.0: 8 Feb 2017.