A Predictive Data Analysis ~ 19-nCoV

A Predictive Data Analysis is a type of data-analysis where after the complete statistical study of the data, the model predicts some estimate when it receives a new datum. It can be for various types of predictions categorization.

Some other types of data-analysis can be Descriptive, Diagnostic or Prescriptive Analysis.

This Data Mining Project involves a sequence of (six) steps i.e.

a) Business Understanding
b) Data Understanding
c) Data Preparation
d) Modelling
e) Evaluation
f) Deployment

Main Data-Mining Concepts Used

In this project, I will be using the time series dataset(s) that will consist of the data values of different locations, worldwide.

DATA PREPROCESSIG & POSTPROCESSING

Anomaly/Outliers - identification, verification & removal
Unit Scaling for Normalization
Identification of the type of distribution
Data Reduction
Exploratory Data Analysis (EDA) - using boxplots & scatter-plots

MODELLING

Regression
Model overfitting
SVMK, kNN, Linear & Polynomial Regression Algorithms

RESULT EVALUATION

Root Mean Squared Error (RMSE)
R-Squared (R²)

1. Introduction
2. Business Understanding
3. Data Understanding
4. Data Preparation
5. Modelling
6. Evaluation
7. Deployment
8. Conclusion
9. Bibliography

The Problem Statement…

~Rate of growth of COVID-19 in China : within next one week~

Introduction

History of Pandemics

novel Coronavirus Diseases: 2020

The Outbreak - COVID-19

History of Pandemics:

Great Plague of Marseille: 1720
As per the records Great Plague of Marseille started when a ship called Grand-Saint-Antoine docked in Marseille, France, carrying a cargo of goods from the eastern Mediterranean.
It continued for the next three years, killing up to 30% of the population of Marseille.

First Cholera Pandemic: 1820
In 1820 the First Cholera Pandemic occurred, in Asia, affecting countries, mainly Indonesia, Thailand and the Philippines. This pandemic has also killed about 100,000 as declared officially. It was a bacterial infection caused due to the contaminated water.

Spanish Influenza/Flu: 1920
Spanish Flu was the most recent and the most unrelenting pandemics. It has infected about half a billion people and killed 100 million. The Spanish flu holds the official record for the deadliest pandemic officially recorded in history.

novel Coronavirus Disease: 2020

In the 21^st century, novel Coronavirus Disease (COVID-19) has appeared as the most severe pandemic, originated from Hubei province of China.

Caused by: novel Coronavirus - 2

Story of its origin -

During November, 2019 - a severe viral infection was noticed in Wuhan, a city in Hubei provinces of China. On November 17^th 2019, the first case of this infection was reported. Doctor’s initially took it lightly as if it was a normal fever/cold. But, when a wide range of patients reported similar kind of symptoms, a doctor - Dr. Li Wenliang of Chinese Academy of Sciences (CAS) Lab, claimed that it was a type of severe acute respiratory syndrome, spreading through a new coronavirus transmission in the whole Hubei province, very rapidly. Reportedly, many attempts were made not to leak this, anywhere. But anyhow, this news got exposed during the late December, 2019. In January, Dr. Li himself died suffering with the same deadly viral infection.

As per the sources, in 2003, the same lab found first deadly SARS Coronavirus, leading to 813 casualties, all over the world, within two months!

Initially, it was named as “Wuhan Virus”. But, as it started spreading rapidly from Hubei to the whole mainland of China, its name got replaced by term “China Virus”.

Finally, on Feb. 11^th 2020, the World Health Organization (WHO) gave the disease an official name: COVID-19.
WHO has also declared the COVID-19 as a pandemic.

Since the outbreak of COVID-19, over 350,000 people have been infected throughout the world and over 15,000 people have been killed.

This map shows the daily spread of the 2019-nCoV

Map GIF

COVID-19: Outbreak in world -

A grand Cruise Ship called Diamond Princess was all set for it’s two weeks’ journey from Yokohama (Japan) to China, Vietnam, Taiwan and back to Japan. Guests boarded it on January 20^th, but as the journey was about to end, on February 1^st tested positive for coronavirus, who disembarked in Hong Kong on January 25^th.

The cruise was cancelled from sailing as Japanese health instructors asked to check all the passengers, along with the crew.

And since then, the number of confirmed cases for COVID-19 kept increasing. Currently, the ship was carrying over 3500 people.

As per the data available, following graphs show the 2019-nCoV cases have increased since Jan. 22^nd:

Confirmed Cases:
Death Cases:

About the Disease:

The COVID-19 is spread as rapidly as it has the Reproduction Number (R-naught) 2.5.

Reproduction Number

Reproduction Number: is meant for the number of average persons, whom an infected person is further infecting. In more generalized language, on an average, from every COVID-19 patient, virus are getting transmitted to at least 2.5 fit persons.
For Ebola, this number is appx. 1.7

In order to have control on any epidemic, the reproducibility score has to be decreased as much as possible.

That’s why, most of the highly countries are locked down. In many of them, there are the shortage of hospitals/beds/doctors, too.

Due to the lockdown, the market is also going down.

Symptoms:

The severe acute respiratory syndrome COVID-19 (SARS-CoV-19), which is often termed as 2019-nCoV, as well, is basically a viral disease spread by the novel coronavirus in which the virus tend to attack onto the respiratory system of the patient.

Though the studies are still going on, following are some most common symptoms of COVID-19, known so far:

Fever

Dry Cough

Fatigue

Shortness of Breath

Nasal Congestion etc….

These are the symptoms that are reported by most of the patients. Apart from these, few patients also found to develop aches and pains and get the sniffles, according to data from China.

Business Understanding:

Now we all are well aware that this is a very severe pandemic i.e. the pandemic of 21^st century. Not only this but another major problem is that it has been around 5 months, since when the 2019-nCoV appeared, first.

Scientists are still trying to create the vaccine as soon as possible, but for now, no final cure is up.

What’s the significance of this analysis?

Due to this pandemic:

Almost every country in the world has at least one confirm case of COVID-19

The diseases is spreading very rapidly

Economy of various countries is going down

Casualties are increasing even more, etc….

China, being the origin of the COVID-19, is the most affected country, in the world.

* This becomes more clear from the following word-clouds

(total confirmed cases & deaths due to coronavirus till 21/03/2020)

Code to generate the map-plot!!

* Here we find the China to be the most affected country!

*CONTRIBUTION OF THIS PROJECT*

Finds an estimate about the rate of increase/decrease in appearance of new cases for next one week, in case of China.

This estimate can help the country to know how much new beds/hospitals are going to be required within next few days.

Helpful for the country in terms of the financial planning, as the economy is also going down.

Directs where the focus has to be driven first.

It would convey a more precise information for deciding that how the money has to be divided for the constructions of new hospitals/isolation centres, COVID-19 test kits, other medical equipments, caring and treatments, etc….

This all would be helpful for other countries as well, in order to bring the Reproduction Number (R-naught) down to an acceptable limit. (by applying useful strategies of China)

* We’ll be preparing separate datasets for generating few of these visualizations, while data-preparation phase.
* For understanding how these visuals are developed from scratch, please refer to map.R and visualizer.R.

Data Understanding:

DATA & PROBLEM STATEMENT

BEGINNING WITH THE CODING SECTION

DATA COLLECTION

* Data & Problem Statement

Our aim is to analyse that what would be the status of the epidemic will change, within a week and hence, how log it might take to China to have its control over this infection
Further, we’d aim to generalize it for any of the country, so that we could find the status of the same, whenever required. Obviously, we’d have to change few parameters, tho’ we won’t have to waste our time again and again in data preparation for a particular country.

Hence, we need the data about:

Confirmed cases of COVID-19, throughout the world

Death cases due to COVID-19, throughout the world

Total Recovery from COVID-19, throughout the world

We’ve collected this data from various sources including Johns Hopkins University, WHO & MoHFW - India.

Beginning with the coding section:

Let’s start with setting-up our working directory and load all the required packages, that we would be required:

Initial Project Setup

#####  Loading LIBRARIES  #####
library(stringr)
library(ggplot2)
library(AUCRF)
library(randomForest)
library(RFmarkerDetector)
library(caret)
library(mlbench)
library(kernlab)

* Data Collection

Basically, we have collected the raw data from websites and GitHub repository.

# loading raw data
check.Confirmed = read.csv("static/raw/time_series_19-covid-Confirmed.csv")
check.Deaths = read.csv("static/raw/time_series_19-covid-Deaths.csv")
check.Recovered = read.csv("static/raw/time_series_19-covid-Recovered.csv")

The data, that we have collected is in the CSV (i.e. “comma-separated values”) format.
A CSV file is used to store the structured data row-wise, where the data elements in each row is separated by a comma (,).

It’s pretty similar to the following:

Belinda Jameson,2017,Cushing House,148,3.52
Jeff Smith,2018,Prescott House,17-D,3.20

Let’s view the head portion of our raw data:

# view sample
head(check.Confirmed)

##     Province.State Country.Region     Lat      Long X1.22.20 X1.23.20
## 1                        Thailand 15.0000  101.0000        2        3
## 2                           Japan 36.0000  138.0000        2        1
## 3                       Singapore  1.2833  103.8333        0        1
## 4                           Nepal 28.1667   84.2500        0        0
## 5                        Malaysia  2.5000  112.5000        0        0
## 6 British Columbia         Canada 49.2827 -123.1207        0        0
##   X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20 X1.31.20
## 1        5        7        8        8       14       14       14       19
## 2        2        2        4        4        7        7       11       15
## 3        3        3        4        5        7        7       10       13
## 4        0        1        1        1        1        1        1        1
## 5        0        3        4        4        4        7        8        8
## 6        0        0        0        0        1        1        1        1
##   X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20 X2.8.20 X2.9.20
## 1      19      19      19      25      25      25      25      32      32
## 2      20      20      20      22      22      45      25      25      26
## 3      16      18      18      24      28      28      30      33      40
## 4       1       1       1       1       1       1       1       1       1
## 5       8       8       8      10      12      12      12      16      16
## 6       1       1       1       1       2       2       4       4       4
##   X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20 X2.16.20 X2.17.20
## 1       32       33       33       33       33       33       34       35
## 2       26       26       28       28       29       43       59       66
## 3       45       47       50       58       67       72       75       77
## 4        1        1        1        1        1        1        1        1
## 5       18       18       18       19       19       22       22       22
## 6        4        4        4        4        4        4        4        5
##   X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20 X2.23.20 X2.24.20 X2.25.20
## 1       35       35       35       35       35       35       35       37
## 2       74       84       94      105      122      147      159      170
## 3       81       84       84       85       85       89       89       91
## 4        1        1        1        1        1        1        1        1
## 5       22       22       22       22       22       22       22       22
## 6        5        5        5        6        6        6        6        7
##   X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20 X3.2.20 X3.3.20 X3.4.20
## 1       40       40       41       42      42      43      43      43
## 2      189      214      228      241     256     274     293     331
## 3       93       93       93      102     106     108     110     110
## 4        1        1        1        1       1       1       1       1
## 5       22       23       23       25      29      29      36      50
## 6        7        7        7        8       8       8       9      12
##   X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20 X3.10.20 X3.11.20 X3.12.20
## 1      47      48      50      50      50       53       59       70
## 2     360     420     461     502     511      581      639      639
## 3     117     130     138     150     150      160      178      178
## 4       1       1       1       1       1        1        1        1
## 5      50      83      93      99     117      129      149      149
## 6      13      21      21      27      32       32       39       46
##   X3.13.20 X3.14.20 X3.15.20 X3.16.20 X3.17.20 X3.18.20 X3.19.20
## 1       75       82      114      147      177      212      272
## 2      701      773      839      825      878      889      924
## 3      200      212      226      243      266      313      345
## 4        1        1        1        1        1        1        1
## 5      197      238      428      566      673      790      900
## 6       64       64       73      103      103      186      231

head(check.Deaths)

##     Province.State Country.Region     Lat      Long X1.22.20 X1.23.20
## 1                        Thailand 15.0000  101.0000        0        0
## 2                           Japan 36.0000  138.0000        0        0
## 3                       Singapore  1.2833  103.8333        0        0
## 4                           Nepal 28.1667   84.2500        0        0
## 5                        Malaysia  2.5000  112.5000        0        0
## 6 British Columbia         Canada 49.2827 -123.1207        0        0
##   X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20 X1.31.20
## 1        0        0        0        0        0        0        0        0
## 2        0        0        0        0        0        0        0        0
## 3        0        0        0        0        0        0        0        0
## 4        0        0        0        0        0        0        0        0
## 5        0        0        0        0        0        0        0        0
## 6        0        0        0        0        0        0        0        0
##   X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20 X2.8.20 X2.9.20
## 1       0       0       0       0       0       0       0       0       0
## 2       0       0       0       0       0       0       0       0       0
## 3       0       0       0       0       0       0       0       0       0
## 4       0       0       0       0       0       0       0       0       0
## 5       0       0       0       0       0       0       0       0       0
## 6       0       0       0       0       0       0       0       0       0
##   X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20 X2.16.20 X2.17.20
## 1        0        0        0        0        0        0        0        0
## 2        0        0        0        1        1        1        1        1
## 3        0        0        0        0        0        0        0        0
## 4        0        0        0        0        0        0        0        0
## 5        0        0        0        0        0        0        0        0
## 6        0        0        0        0        0        0        0        0
##   X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20 X2.23.20 X2.24.20 X2.25.20
## 1        0        0        0        0        0        0        0        0
## 2        1        1        1        1        1        1        1        1
## 3        0        0        0        0        0        0        0        0
## 4        0        0        0        0        0        0        0        0
## 5        0        0        0        0        0        0        0        0
## 6        0        0        0        0        0        0        0        0
##   X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20 X3.2.20 X3.3.20 X3.4.20
## 1        0        0        0        0       1       1       1       1
## 2        2        4        4        5       6       6       6       6
## 3        0        0        0        0       0       0       0       0
## 4        0        0        0        0       0       0       0       0
## 5        0        0        0        0       0       0       0       0
## 6        0        0        0        0       0       0       0       0
##   X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20 X3.10.20 X3.11.20 X3.12.20
## 1       1       1       1       1       1        1        1        1
## 2       6       6       6       6      10       10       15       16
## 3       0       0       0       0       0        0        0        0
## 4       0       0       0       0       0        0        0        0
## 5       0       0       0       0       0        0        0        0
## 6       0       0       0       0       1        1        1        1
##   X3.13.20 X3.14.20 X3.15.20 X3.16.20 X3.17.20 X3.18.20 X3.19.20
## 1        1        1        1        1        1        1        1
## 2       19       22       22       27       29       29       29
## 3        0        0        0        0        0        0        0
## 4        0        0        0        0        0        0        0
## 5        0        0        0        0        2        2        2
## 6        1        1        1        4        4        7        7

head(check.Recovered)

##     Province.State Country.Region     Lat      Long X1.22.20 X1.23.20
## 1                        Thailand 15.0000  101.0000        0        0
## 2                           Japan 36.0000  138.0000        0        0
## 3                       Singapore  1.2833  103.8333        0        0
## 4                           Nepal 28.1667   84.2500        0        0
## 5                        Malaysia  2.5000  112.5000        0        0
## 6 British Columbia         Canada 49.2827 -123.1207        0        0
##   X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20 X1.31.20
## 1        0        0        2        2        5        5        5        5
## 2        0        0        1        1        1        1        1        1
## 3        0        0        0        0        0        0        0        0
## 4        0        0        0        0        0        0        0        0
## 5        0        0        0        0        0        0        0        0
## 6        0        0        0        0        0        0        0        0
##   X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20 X2.8.20 X2.9.20
## 1       5       5       5       5       5       5       5      10      10
## 2       1       1       1       1       1       1       1       1       1
## 3       0       0       0       0       0       0       0       2       2
## 4       0       0       0       0       0       0       0       0       0
## 5       0       0       0       0       0       0       1       1       1
## 6       0       0       0       0       0       0       0       0       0
##   X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20 X2.16.20 X2.17.20
## 1       10       10       10       12       12       12       14       15
## 2        4        9        9        9        9       12       12       12
## 3        2        9       15       15       17       18       18       24
## 4        0        0        1        1        1        1        1        1
## 5        1        3        3        3        3        7        7        7
## 6        0        0        0        0        0        0        0        0
##   X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20 X2.23.20 X2.24.20 X2.25.20
## 1       15       15       15       17       17       21       21       22
## 2       13       18       18       22       22       22       22       22
## 3       29       34       34       37       37       51       51       53
## 4        1        1        1        1        1        1        1        1
## 5       13       15       15       15       15       15       18       18
## 6        0        0        0        0        0        0        0        0
##   X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20 X3.2.20 X3.3.20 X3.4.20
## 1       22       22       28       28      28      31      31      31
## 2       22       22       22       32      32      32      43      43
## 3       62       62       62       72      72      78      78      78
## 4        1        1        1        1       1       1       1       1
## 5       18       18       18       18      18      18      22      22
## 6        0        3        3        3       3       3       3       3
##   X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20 X3.10.20 X3.11.20 X3.12.20
## 1      31      31      31      31      31       33       34       34
## 2      43      46      76      76      76      101      118      118
## 3      78      78      78      78      78       78       96       96
## 4       1       1       1       1       1        1        1        1
## 5      22      22      23      24      24       24       26       26
## 6       3       3       4       4       4        4        4        4
##   X3.13.20 X3.14.20 X3.15.20 X3.16.20 X3.17.20 X3.18.20 X3.19.20
## 1       35       35       35       35       41       42       42
## 2      118      118      118      144      144      144      150
## 3       97      105      105      109      114      114      114
## 4        1        1        1        1        1        1        1
## 5       26       35       42       42       49       60       75
## 6        4        4        4        4        4        4        4

We can see that type of collected data is: time-series

A Time series data of a variable have a set of observations on values at different points of time. They are usually collected at fixed intervals, such as daily, weekly, monthly, annually, quarterly, etc.

We observe that all the three datasets have same columns as well as the same type of data.

# columns
cat("Number of columns in all 3 datasets:-\n\n")

## Number of columns in all 3 datasets:-

matrix(
    c("Confirmed", "Deaths", "Recovered", ncol(check.Confirmed), ncol(check.Deaths), ncol(check.Recovered)),
    nrow = 2, ncol = 3, byrow = T
)

##      [,1]        [,2]     [,3]       
## [1,] "Confirmed" "Deaths" "Recovered"
## [2,] "62"        "62"     "62"

# Dimention
cat("Dimentions of datasets:-\n")   # same for all 3

## Dimentions of datasets:-

dim(check.Confirmed)

## [1] 468  62

######################
# columns' name
#cat("Name of columns in all 3 datasets:-\n")

#colnames(check.Confirmed)
#colnames(check.Deaths)
#colnames(check.Recovered)

Let’s see the structure of these datasets:

str(check.Confirmed)

## 'data.frame':    468 obs. of  62 variables:
##  $ Province.State: Factor w/ 321 levels "","Adams, IN",..: 1 1 1 1 1 25 196 298 237 1 ...
##  $ Country.Region: Factor w/ 155 levels "Afghanistan",..: 142 76 129 103 90 27 8 8 8 25 ...
##  $ Lat           : num  15 36 1.28 28.17 2.5 ...
##  $ Long          : num  101 138 103.8 84.2 112.5 ...
##  $ X1.22.20      : int  2 2 0 0 0 0 0 0 0 0 ...
##  $ X1.23.20      : int  3 1 1 0 0 0 0 0 0 0 ...
##  $ X1.24.20      : int  5 2 3 0 0 0 0 0 0 0 ...
##  $ X1.25.20      : int  7 2 3 1 3 0 0 0 0 0 ...
##  $ X1.26.20      : int  8 4 4 1 4 0 3 1 0 0 ...
##  $ X1.27.20      : int  8 4 5 1 4 0 4 1 0 1 ...
##  $ X1.28.20      : int  14 7 7 1 4 1 4 1 0 1 ...
##  $ X1.29.20      : int  14 7 7 1 7 1 4 1 1 1 ...
##  $ X1.30.20      : int  14 11 10 1 8 1 4 2 3 1 ...
##  $ X1.31.20      : int  19 15 13 1 8 1 4 3 2 1 ...
##  $ X2.1.20       : int  19 20 16 1 8 1 4 4 3 1 ...
##  $ X2.2.20       : int  19 20 18 1 8 1 4 4 2 1 ...
##  $ X2.3.20       : int  19 20 18 1 8 1 4 4 2 1 ...
##  $ X2.4.20       : int  25 22 24 1 10 1 4 4 3 1 ...
##  $ X2.5.20       : int  25 22 28 1 12 2 4 4 3 1 ...
##  $ X2.6.20       : int  25 45 28 1 12 2 4 4 4 1 ...
##  $ X2.7.20       : int  25 25 30 1 12 4 4 4 5 1 ...
##  $ X2.8.20       : int  32 25 33 1 16 4 4 4 5 1 ...
##  $ X2.9.20       : int  32 26 40 1 16 4 4 4 5 1 ...
##  $ X2.10.20      : int  32 26 45 1 18 4 4 4 5 1 ...
##  $ X2.11.20      : int  33 26 47 1 18 4 4 4 5 1 ...
##  $ X2.12.20      : int  33 28 50 1 18 4 4 4 5 1 ...
##  $ X2.13.20      : int  33 28 58 1 19 4 4 4 5 1 ...
##  $ X2.14.20      : int  33 29 67 1 19 4 4 4 5 1 ...
##  $ X2.15.20      : int  33 43 72 1 22 4 4 4 5 1 ...
##  $ X2.16.20      : int  34 59 75 1 22 4 4 4 5 1 ...
##  $ X2.17.20      : int  35 66 77 1 22 5 4 4 5 1 ...
##  $ X2.18.20      : int  35 74 81 1 22 5 4 4 5 1 ...
##  $ X2.19.20      : int  35 84 84 1 22 5 4 4 5 1 ...
##  $ X2.20.20      : int  35 94 84 1 22 5 4 4 5 1 ...
##  $ X2.21.20      : int  35 105 85 1 22 6 4 4 5 1 ...
##  $ X2.22.20      : int  35 122 85 1 22 6 4 4 5 1 ...
##  $ X2.23.20      : int  35 147 89 1 22 6 4 4 5 1 ...
##  $ X2.24.20      : int  35 159 89 1 22 6 4 4 5 1 ...
##  $ X2.25.20      : int  37 170 91 1 22 7 4 4 5 1 ...
##  $ X2.26.20      : int  40 189 93 1 22 7 4 4 5 1 ...
##  $ X2.27.20      : int  40 214 93 1 23 7 4 4 5 1 ...
##  $ X2.28.20      : int  41 228 93 1 23 7 4 4 5 1 ...
##  $ X2.29.20      : int  42 241 102 1 25 8 4 7 9 1 ...
##  $ X3.1.20       : int  42 256 106 1 29 8 6 7 9 1 ...
##  $ X3.2.20       : int  43 274 108 1 29 8 6 9 9 1 ...
##  $ X3.3.20       : int  43 293 110 1 36 9 13 9 11 1 ...
##  $ X3.4.20       : int  43 331 110 1 50 12 22 10 11 1 ...
##  $ X3.5.20       : int  47 360 117 1 50 13 22 10 13 1 ...
##  $ X3.6.20       : int  48 420 130 1 83 21 26 10 13 1 ...
##  $ X3.7.20       : int  50 461 138 1 93 21 28 11 13 1 ...
##  $ X3.8.20       : int  50 502 150 1 99 27 38 11 15 2 ...
##  $ X3.9.20       : int  50 511 150 1 117 32 48 15 15 2 ...
##  $ X3.10.20      : int  53 581 160 1 129 32 55 18 18 2 ...
##  $ X3.11.20      : int  59 639 178 1 149 39 65 21 20 3 ...
##  $ X3.12.20      : int  70 639 178 1 149 46 65 21 20 3 ...
##  $ X3.13.20      : int  75 701 200 1 197 64 92 36 35 5 ...
##  $ X3.14.20      : int  82 773 212 1 238 64 112 49 46 7 ...
##  $ X3.15.20      : int  114 839 226 1 428 73 134 57 61 7 ...
##  $ X3.16.20      : int  147 825 243 1 566 103 171 71 68 7 ...
##  $ X3.17.20      : int  177 878 266 1 673 103 210 94 78 33 ...
##  $ X3.18.20      : int  212 889 313 1 790 186 267 121 94 35 ...
##  $ X3.19.20      : int  272 924 345 1 900 231 307 121 144 37 ...

#str(check.Deaths)
#str(check.Recovered)

Explanation of each columns in fetched data:

There are 3 dedicated databases for data about Confirmed/Death/Recovery cases, all around the world

Province.State:

Data-type: factor they can be specific, unique and valid names

Holds name of City/Province/State, where the data is coming from

Eg.: Hubei

Country.Region:

Data-type: factor they can be specific, unique and valid names

Holds name of the country, in which the reported area comes

Eg.: China (Hubei is a Province of China)

Lat:

Data-type: numeric (i.e. can have values in decimals, too)

Holds the Latitude position of the given place(as in col1)

Eg.: Latitude position of Hubei = 30.9756

Long:

Data-type: numeric (i.e. can have values in decimals, too)

Holds the longitude position of the given place(as in col1)

Eg.: Longitude position of Hubei = 112.2707

Col. 5 to 62:

Data-type: integer (i.e. discrete) and remains always positive as it determines the no, of individuals

It’s a time series data where the data is collected at various interval of time

Each datum value is represented, based on the different days in series (from 22/01/2020)

The constant entity is the location, whose data is represented in every row

# showing data of Hubei
cat("A sample data of a location from \"Confirmed Cases\'\" dataset:\n")

## A sample data of a location from "Confirmed Cases'" dataset:

check.Confirmed[which(str_detect(check.Confirmed$Province.State, "Hubei")),]

##     Province.State Country.Region     Lat     Long X1.22.20 X1.23.20
## 155          Hubei          China 30.9756 112.2707      444      444
##     X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20
## 155      549      761     1058     1423     3554     3554     4903
##     X1.31.20 X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20
## 155     5806    7153   11177   13522   16678   19665   22112   24953
##     X2.8.20 X2.9.20 X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20
## 155   27100   29631    31728    33366    33366    48206    54406    56249
##     X2.16.20 X2.17.20 X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20
## 155    58182    59989    61682    62031    62442    62662    64084
##     X2.23.20 X2.24.20 X2.25.20 X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20
## 155    64084    64287    64786    65187    65596    65914    66337   66907
##     X3.2.20 X3.3.20 X3.4.20 X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20
## 155   67103   67217   67332   67466   67592   67666   67707   67743
##     X3.10.20 X3.11.20 X3.12.20 X3.13.20 X3.14.20 X3.15.20 X3.16.20
## 155    67760    67773    67781    67786    67790    67794    67798
##     X3.17.20 X3.18.20 X3.19.20
## 155    67799    67800    67800

Data Preparation:

DATA CLEANING

DATA REDUCTION

Scaling

DATA TRANSFORMATION

Arranging data Country-Wise

Pooled Datasets

Now, as we have the raw data for our analysis, we can move forward for our next phase i.e. Data-Preparation.
* The data-preparation is considered to be the most time-consuming phase of any data science project.
* On an average, an ideal data-science project’s 90% of time is spent during Data-Collection and Data-Preparation.

* Data Cleaning

Whenever we collect any kind of the raw-data from various sources, it has a lot of the vulnerabilities.

Most often, these are of following types:

NAs and NaNs
Missing data values
Incorrect data values

Checking for these flaws in our data:

# sample NA values
check.Confirmed[which(str_detect(check.Confirmed$Country.Region, "Cruise Ship")),]

##       Province.State Country.Region     Lat    Long X1.22.20 X1.23.20
## 166 Diamond Princess    Cruise Ship 35.4437 139.638       NA       NA
##     X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20
## 166       NA       NA       NA       NA       NA       NA       NA
##     X1.31.20 X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20
## 166       NA      NA      NA      NA      NA      NA      NA      61
##     X2.8.20 X2.9.20 X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20
## 166      61      64      135      135      175      175      218      285
##     X2.16.20 X2.17.20 X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20
## 166      355      454      542      621      634      634      634
##     X2.23.20 X2.24.20 X2.25.20 X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20
## 166      691      691      691      705      705      705      705     705
##     X3.2.20 X3.3.20 X3.4.20 X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20
## 166     705     706     706     706     696     696     696     696
##     X3.10.20 X3.11.20 X3.12.20 X3.13.20 X3.14.20 X3.15.20 X3.16.20
## 166      696      696      696      696      696      696      696
##     X3.17.20 X3.18.20 X3.19.20
## 166      696      712      712

# sample wrong data "french guiana" the data value can not decrease on next day
check.Confirmed[which(str_detect(check.Confirmed$Province.State, "French Guiana")),]

##     Province.State Country.Region    Lat     Long X1.22.20 X1.23.20
## 431  French Guiana         France 3.9339 -53.1258        0        0
##     X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20
## 431        0        0        0        0        0        0        0
##     X1.31.20 X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20
## 431        0       0       0       0       0       0       0       0
##     X2.8.20 X2.9.20 X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20
## 431       0       0        0        0        0        0        0        0
##     X2.16.20 X2.17.20 X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20
## 431        0        0        0        0        0        0        0
##     X2.23.20 X2.24.20 X2.25.20 X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20
## 431        0        0        0        0        0        0        0       0
##     X3.2.20 X3.3.20 X3.4.20 X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20
## 431       0       0       0       0       0       5       5       5
##     X3.10.20 X3.11.20 X3.12.20 X3.13.20 X3.14.20 X3.15.20 X3.16.20
## 431        5        3        5        3        5        7       11
##     X3.17.20 X3.18.20 X3.19.20
## 431       11       11       11

# sample blank data ---> State (to be replaced by 'Others')
head(check.Recovered)

##     Province.State Country.Region     Lat      Long X1.22.20 X1.23.20
## 1                        Thailand 15.0000  101.0000        0        0
## 2                           Japan 36.0000  138.0000        0        0
## 3                       Singapore  1.2833  103.8333        0        0
## 4                           Nepal 28.1667   84.2500        0        0
## 5                        Malaysia  2.5000  112.5000        0        0
## 6 British Columbia         Canada 49.2827 -123.1207        0        0
##   X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20 X1.31.20
## 1        0        0        2        2        5        5        5        5
## 2        0        0        1        1        1        1        1        1
## 3        0        0        0        0        0        0        0        0
## 4        0        0        0        0        0        0        0        0
## 5        0        0        0        0        0        0        0        0
## 6        0        0        0        0        0        0        0        0
##   X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20 X2.8.20 X2.9.20
## 1       5       5       5       5       5       5       5      10      10
## 2       1       1       1       1       1       1       1       1       1
## 3       0       0       0       0       0       0       0       2       2
## 4       0       0       0       0       0       0       0       0       0
## 5       0       0       0       0       0       0       1       1       1
## 6       0       0       0       0       0       0       0       0       0
##   X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20 X2.16.20 X2.17.20
## 1       10       10       10       12       12       12       14       15
## 2        4        9        9        9        9       12       12       12
## 3        2        9       15       15       17       18       18       24
## 4        0        0        1        1        1        1        1        1
## 5        1        3        3        3        3        7        7        7
## 6        0        0        0        0        0        0        0        0
##   X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20 X2.23.20 X2.24.20 X2.25.20
## 1       15       15       15       17       17       21       21       22
## 2       13       18       18       22       22       22       22       22
## 3       29       34       34       37       37       51       51       53
## 4        1        1        1        1        1        1        1        1
## 5       13       15       15       15       15       15       18       18
## 6        0        0        0        0        0        0        0        0
##   X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20 X3.2.20 X3.3.20 X3.4.20
## 1       22       22       28       28      28      31      31      31
## 2       22       22       22       32      32      32      43      43
## 3       62       62       62       72      72      78      78      78
## 4        1        1        1        1       1       1       1       1
## 5       18       18       18       18      18      18      22      22
## 6        0        3        3        3       3       3       3       3
##   X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20 X3.10.20 X3.11.20 X3.12.20
## 1      31      31      31      31      31       33       34       34
## 2      43      46      76      76      76      101      118      118
## 3      78      78      78      78      78       78       96       96
## 4       1       1       1       1       1        1        1        1
## 5      22      22      23      24      24       24       26       26
## 6       3       3       4       4       4        4        4        4
##   X3.13.20 X3.14.20 X3.15.20 X3.16.20 X3.17.20 X3.18.20 X3.19.20
## 1       35       35       35       35       41       42       42
## 2      118      118      118      144      144      144      150
## 3       97      105      105      109      114      114      114
## 4        1        1        1        1        1        1        1
## 5       26       35       42       42       49       60       75
## 6        4        4        4        4        4        4        4

So, there are also many issues (like blanks in the place of states’ name and data of a Cruise Ship among countries’ data) with our available datasets.
To get rid of these issues, the data-cleaning is performed.

For data cleaning, we consider either of these two methods (or both, too):

Removal:
Here we usually remove or delete those rows/columns, where we find the vulnerabilities.
These rows/columns might include NAs.
Replacement/Filling:
Here we replace the NAs or incorrect or blanks data values with some acceptable value.
Mostly, values are replaced by Mean or Mode values, so that the overall statistical structure may remain the same.
Sometimes, we also fill them on the basis of some specific calculations.

What will we do?

Because we have the time-series dataset populated with discrete data values, storing the total count of the total people (having COVID-19 confirmed, have died due to COVID-19 or have recovered from COVID-19), the issues:

can NOT be resolved by MEAN
because in our case, either the Data value can remain CONSTANT or can INCREASE, on every next day.
the MEAN need not be discrete
MEAN can also be less than the previous data, for any particular day etc…

can NOT be resolved by MODE
because it’s a medical data and hence, any most often occurring number cannot be blindly replaced with a missing value etc…

Hence, we’ll NOT be using any of the replacement of MEAN/MODE/MEDIAN

We’ll replace with maximum values

We will be replacing the missing values or NAs with the maximum value up to a day before the current day.
It means that - the values are carried constant for the next day whose data is missing.

# removing NAs, replacing incorrect values

for (i in 1:nrow(check.Confirmed)) {
  for (j in 5:ncol(check.Confirmed)) {
    if(j==5) {
      check.Confirmed[i,j] = ifelse(is.na(check.Confirmed[i, j]), 0, check.Confirmed[i,j])
    } else {
      if(is.na(check.Confirmed[i, j])){
        check.Confirmed[i,j] = check.Confirmed[i, (j-1)]
      } else if(check.Confirmed[i, (j-1)] > check.Confirmed[i, j]){
        check.Confirmed[i,j] = check.Confirmed[i, (j-1)]
      }
    }
  }
}

for (i in 1:nrow(check.Deaths)) {
  for (j in 5:ncol(check.Deaths)) {
    if(j==5) {
      check.Deaths[i,j] = ifelse(is.na(check.Deaths[i, j]), 0, check.Deaths[i,j])
    } else {
      if(is.na(check.Deaths[i, j])){
        check.Deaths[i,j] = check.Deaths[i, (j-1)]
      } else if(check.Deaths[i, (j-1)] > check.Deaths[i, j]){
        check.Deaths[i,j] = check.Deaths[i, (j-1)]
      }
    }
  }
}

for (i in 1:nrow(check.Recovered)) {
  for (j in 5:ncol(check.Recovered)) {
    if(j==5) {
      check.Recovered[i,j] = ifelse(is.na(check.Recovered[i, j]), 0, check.Recovered[i,j])
    } else {
      if(is.na(check.Recovered[i, j])){
        check.Recovered[i,j] = check.Recovered[i, (j-1)]
      } else if(check.Recovered[i, (j-1)] > check.Recovered[i, j]){
        check.Recovered[i,j] = check.Recovered[i, (j-1)]
      }
    }
  }
}

# replace blanks and incorrect country/state names

# replacing in states
states = as.character(check.Confirmed$Province.State)
states.levels = as.character(levels(check.Confirmed$Province.State))

states[states %in% ""] = "Others"
states.levels[states.levels %in% ""] = "Others"


#######
states[states %in% "From Diamond Princess"] = "Diamond Princess"
states.levels = states.levels[!states.levels %in% "From Diamond Princess"]


# replacing in countries
countries = as.character(check.Confirmed$Country.Region)
countries.levels = as.character(levels(check.Confirmed$Country.Region))

countries[countries %in% "US"] = "United States"
countries[countries %in% "UK"] = "United Kingdom"
countries[countries %in% "Taiwan*"] = "Taiwan"
countries[countries %in% "The Bahamas"] = "Bahamas"
countries[countries %in% "Gambia, The"] = "Gambia"
countries[countries %in% "Korea, South"] = "South Korea"
countries[countries %in% c("Congo (Brazzaville)", "Congo (Kinshasa)", "Republic of the Congo")] = "Democratic Republic of the Congo"
###
countries.levels[countries.levels %in% "US"] = "United States"
countries.levels[countries.levels %in% "UK"] = "United Kingdom"
countries.levels[countries.levels %in% "Taiwan*"] = "Taiwan"
countries.levels[countries.levels %in% "The Bahamas"] = "Bahamas"
countries.levels[countries.levels %in% "Gambia, The"] = "Gambia"
countries.levels[countries.levels %in% "Korea, South"] = "South Korea"

countries.levels = countries.levels[!countries.levels %in% c("Congo (Brazzaville)", "Congo (Kinshasa)", "Republic of the Congo")]
countries.levels = c(countries.levels, "Democratic Republic of the Congo")
###############################

# rectified fectors
states.factor  = factor(c(states), levels = c(states.levels))
countries.factor  = factor(countries, levels = countries.levels)


## CUZ' INITIAL 4 COLUMNS ARE COMMON IN ALL 3 DATASETS ##

# editing factors in datasets
check.Confirmed = cbind(
                    Province.State = states.factor,
                    Country.Region = countries.factor,
                    check.Confirmed[,3:ncol(check.Confirmed)]
                  )

check.Deaths = cbind(
                    Province.State = states.factor,
                    Country.Region = countries.factor,
                    check.Deaths[,3:ncol(check.Deaths)]
                  )

check.Recovered = cbind(
                    Province.State = states.factor,
                    Country.Region = countries.factor,
                    check.Recovered[,3:ncol(check.Recovered)]
                  )

Now our data has been cleaned.
Viewing the cleaned data.

# sample NA val
check.Confirmed[which(str_detect(check.Confirmed$Country.Region, "Cruise Ship")),]

##       Province.State Country.Region     Lat    Long X1.22.20 X1.23.20
## 166 Diamond Princess    Cruise Ship 35.4437 139.638        0        0
##     X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20
## 166        0        0        0        0        0        0        0
##     X1.31.20 X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20
## 166        0       0       0       0       0       0       0      61
##     X2.8.20 X2.9.20 X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20
## 166      61      64      135      135      175      175      218      285
##     X2.16.20 X2.17.20 X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20
## 166      355      454      542      621      634      634      634
##     X2.23.20 X2.24.20 X2.25.20 X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20
## 166      691      691      691      705      705      705      705     705
##     X3.2.20 X3.3.20 X3.4.20 X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20
## 166     705     706     706     706     706     706     706     706
##     X3.10.20 X3.11.20 X3.12.20 X3.13.20 X3.14.20 X3.15.20 X3.16.20
## 166      706      706      706      706      706      706      706
##     X3.17.20 X3.18.20 X3.19.20
## 166      706      712      712

# sample wrong data "french guiana"
check.Confirmed[which(str_detect(check.Confirmed$Province.State, "French Guiana")),]

##     Province.State Country.Region    Lat     Long X1.22.20 X1.23.20
## 431  French Guiana         France 3.9339 -53.1258        0        0
##     X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20
## 431        0        0        0        0        0        0        0
##     X1.31.20 X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20
## 431        0       0       0       0       0       0       0       0
##     X2.8.20 X2.9.20 X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20
## 431       0       0        0        0        0        0        0        0
##     X2.16.20 X2.17.20 X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20
## 431        0        0        0        0        0        0        0
##     X2.23.20 X2.24.20 X2.25.20 X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20
## 431        0        0        0        0        0        0        0       0
##     X3.2.20 X3.3.20 X3.4.20 X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20
## 431       0       0       0       0       0       5       5       5
##     X3.10.20 X3.11.20 X3.12.20 X3.13.20 X3.14.20 X3.15.20 X3.16.20
## 431        5        5        5        5        5        7       11
##     X3.17.20 X3.18.20 X3.19.20
## 431       11       11       11

# sample blank data ---> State (replaced by 'Other')
head(check.Recovered)

##     Province.State Country.Region     Lat      Long X1.22.20 X1.23.20
## 1           Others       Thailand 15.0000  101.0000        0        0
## 2           Others          Japan 36.0000  138.0000        0        0
## 3           Others      Singapore  1.2833  103.8333        0        0
## 4           Others          Nepal 28.1667   84.2500        0        0
## 5           Others       Malaysia  2.5000  112.5000        0        0
## 6 British Columbia         Canada 49.2827 -123.1207        0        0
##   X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20 X1.31.20
## 1        0        0        2        2        5        5        5        5
## 2        0        0        1        1        1        1        1        1
## 3        0        0        0        0        0        0        0        0
## 4        0        0        0        0        0        0        0        0
## 5        0        0        0        0        0        0        0        0
## 6        0        0        0        0        0        0        0        0
##   X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20 X2.8.20 X2.9.20
## 1       5       5       5       5       5       5       5      10      10
## 2       1       1       1       1       1       1       1       1       1
## 3       0       0       0       0       0       0       0       2       2
## 4       0       0       0       0       0       0       0       0       0
## 5       0       0       0       0       0       0       1       1       1
## 6       0       0       0       0       0       0       0       0       0
##   X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20 X2.16.20 X2.17.20
## 1       10       10       10       12       12       12       14       15
## 2        4        9        9        9        9       12       12       12
## 3        2        9       15       15       17       18       18       24
## 4        0        0        1        1        1        1        1        1
## 5        1        3        3        3        3        7        7        7
## 6        0        0        0        0        0        0        0        0
##   X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20 X2.23.20 X2.24.20 X2.25.20
## 1       15       15       15       17       17       21       21       22
## 2       13       18       18       22       22       22       22       22
## 3       29       34       34       37       37       51       51       53
## 4        1        1        1        1        1        1        1        1
## 5       13       15       15       15       15       15       18       18
## 6        0        0        0        0        0        0        0        0
##   X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20 X3.2.20 X3.3.20 X3.4.20
## 1       22       22       28       28      28      31      31      31
## 2       22       22       22       32      32      32      43      43
## 3       62       62       62       72      72      78      78      78
## 4        1        1        1        1       1       1       1       1
## 5       18       18       18       18      18      18      22      22
## 6        0        3        3        3       3       3       3       3
##   X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20 X3.10.20 X3.11.20 X3.12.20
## 1      31      31      31      31      31       33       34       34
## 2      43      46      76      76      76      101      118      118
## 3      78      78      78      78      78       78       96       96
## 4       1       1       1       1       1        1        1        1
## 5      22      22      23      24      24       24       26       26
## 6       3       3       4       4       4        4        4        4
##   X3.13.20 X3.14.20 X3.15.20 X3.16.20 X3.17.20 X3.18.20 X3.19.20
## 1       35       35       35       35       41       42       42
## 2      118      118      118      144      144      144      150
## 3       97      105      105      109      114      114      114
## 4        1        1        1        1        1        1        1
## 5       26       35       42       42       49       60       75
## 6        4        4        4        4        4        4        4

Identification of the Distribution

# total cases of different catagories on the daily basis
dailyConfirmed = apply(check.Confirmed[,5:ncol(check.Confirmed)], 2, sum)
dailyDeaths = apply(check.Deaths[,5:ncol(check.Deaths)], 2, sum)
dailyRecovered = apply(check.Recovered[,5:ncol(check.Recovered)], 2, sum)

# active cases
dailyActiveCases = dailyConfirmed - (dailyDeaths + dailyRecovered)

# day count till March 21st
days = 1:length(dailyConfirmed)

trendDF = data.frame(
                Day = days,
                Confirmed = dailyConfirmed,
                Deaths = dailyDeaths,
                Recovered = dailyRecovered,
                Active.Cases = dailyActiveCases)
head(trendDF, 5)

##          Day Confirmed Deaths Recovered Active.Cases
## X1.22.20   1       555     17        28          510
## X1.23.20   2       654     18        30          606
## X1.24.20   3       941     26        36          879
## X1.25.20   4      1434     42        39         1353
## X1.26.20   5      2118     56        52         2010

Visualizing this trend

options(repr.plot.width=12, repr.plot.height=10)


ggplot(trendDF, aes(x = Day, y = Active.Cases)) +
    # Active cases in LIGHTBLUE
    geom_point(color = "lightblue") +
    geom_line(color = "lightblue") +

    # Confirmed Cases - RED
    geom_point(aes(y = Confirmed), color = "red") +
    geom_line(aes(y = Confirmed), color = "red") +
    # Death Cases - BLACK
    geom_point(aes(y = Deaths), color = "black") +
    geom_line(aes(y = Deaths), color = "black") +
    # Recovered Cases - GREEN
    geom_point(aes(y = Recovered), color = "green") +
    geom_line(aes(y = Recovered), color = "green") +

    labs(x="Days Count", y="Cases Count") +
    theme_classic() +
    theme(
              text = element_text(family = "Gill Sans")
              ,plot.title = element_text(size = 20, face = "bold", hjust = 0.5)
              ,plot.subtitle = element_text(size = 25, family = "Courier", face = "bold", hjust = 0.5)
              ,axis.text = element_text(size = 12)
              ,axis.title = element_text(size = 20)
      )

Here we see that all the main three catagories of cases have kept increasing with days!

allConfirmed = as.numeric(check.Confirmed[,ncol(check.Confirmed)])
allDeaths = as.numeric(check.Deaths[,ncol(check.Deaths)])
allRecoveries = as.numeric(check.Recovered[,ncol(check.Recovered)])

Checking for the mean of Active Cases till 21^st March:

# finding the mean of each catagories
meanConfirmed = mean(allConfirmed)
meanDeaths = mean(allDeaths)
meanRecovered = mean(allRecoveries)

meanConfirmed

## [1] 519.7308

meanDeaths

## [1] 21.14316

meanRecovered

## [1] 181.3953

This indicates that:
meanActiveCases = \({\mu}\) = 374.9399142

Calculating 1/\({\mu}\) using following equation:
\({\lambda}\) = 1/\({\mu}\)

# once for Confirmed Cases
lambda = 1/meanConfirmed
lambda

## [1] 0.001924073

Here we see that:
\({\lambda}\) > 0
The same thing is also true for rest of the two catagories.

As per our studies till now:

We find that- at a lot extent, the ternd follows an Exponential Distribution, as:

The Cases are increasing with the Days

\({\lambda}\) > 0 etc…

It becomes clear as we look onto the main properties of an Exponential Distribution:

\({\lambda}\) > 0

mean = \({\mu}\) = 1/\({\lambda}\)

x-axis (corresponding to Days) belongs to the [ 0, \({\infty}\) )

y value increases with x

So, we have the data distributed Exponentially
Although, we see that trend of active cases do NOT always increase, and hence, it is not distributed exponentially:

ggplot(trendDF, aes(x = Day, y = Active.Cases)) +
    # Active cases in BLUE
    geom_point(color = "blue") +
    geom_line(color = "blue") +
    theme_classic()

So, we’ll have to do modelling that we wiil do later on.

* Data Reduction

Though we have cleaned the dataset, yet we see that ‘Diamond Princess’ Cruise is still therein among the countries’ data.
Hence, it’s an outlier, and hence, has to be separated.

So, now we’ll start the process of data reduction

# removing diamond princess
Diamond.Princess.Confirmed = check.Confirmed[ which(str_detect(check.Confirmed$Country.Region, "Cruise Ship", negate = F)), ]
check.Confirmed = check.Confirmed[ which(str_detect(check.Confirmed$Country.Region, "Cruise Ship", negate = T)), ]

Diamond.Princess.Deaths = check.Deaths[ which(str_detect(check.Deaths$Country.Region, "Cruise Ship", negate = F)),]
check.Deaths = check.Deaths[ which(str_detect(check.Deaths$Country.Region, "Cruise Ship", negate = T)), ]

Diamond.Princess.Recovered = check.Recovered[ which(str_detect(check.Recovered$Country.Region, "Cruise Ship", negate = F)), ]
check.Recovered = check.Recovered[ which(str_detect(check.Recovered$Country.Region, "Cruise Ship", negate = T)), ]

## Rectifying Row sequences
row.names(check.Confirmed) <- NULL
row.names(check.Deaths) <- NULL
row.names(check.Recovered) <- NULL

Verifying if it is removed or not:

# Let's check whether Diamond Princess is still at row 166 or not
check.Confirmed[166,]

##     Province.State Country.Region     Lat    Long X1.22.20 X1.23.20
## 166        Jiangsu          China 32.9711 119.455        1        5
##     X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20
## 166        9       18       33       47       70       99      129
##     X1.31.20 X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20
## 166      168     202     236     271     308     341     373     408
##     X2.8.20 X2.9.20 X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20
## 166     439     468      492      515      543      570      593      604
##     X2.16.20 X2.17.20 X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20
## 166      617      626      629      631      631      631      631
##     X2.23.20 X2.24.20 X2.25.20 X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20
## 166      631      631      631      631      631      631      631     631
##     X3.2.20 X3.3.20 X3.4.20 X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20
## 166     631     631     631     631     631     631     631     631
##     X3.10.20 X3.11.20 X3.12.20 X3.13.20 X3.14.20 X3.15.20 X3.16.20
## 166      631      631      631      631      631      631      631
##     X3.17.20 X3.18.20 X3.19.20
## 166      631      631      631

#check.Deaths[166,]
#check.Recovered[166,]


# also checking dimention
cat("\nEarlier dimention: 468 X 62\n\n")    # as we saw initially

## 
## Earlier dimention: 468 X 62

cat("New dimention: ", dim(check.Confirmed))

## New dimention:  467 62

OK! So it’s gone!

Scaling

Now we have the dataset that hold the counts of the COVID-19 cases of different geographical locations.
Hence, we can now create a dataset to generate the map for every unique day.(that we saw early in this project)

It means, we want to plot all the countries/regions that are affected on a particular day
It gives us an idea that - among all the given countries, either we are going to plot a selected country on world-map, or not, for a specific day
the factor on which basis we’ll be deciding is - whether country has any confirmed case till that day or not
So, We’d also need the Latitude and Longitude position for those country

Finally, we can roughly estimate that we can have only 2 choices for any region, say 0 & 1, such that:

0: don’t plot on map, if it has no (total) confirm cases
1: plot on map, if it has some confirm cases i.e. there is at least 1 confirm case on that day

Therefore, We’re going to use Unit Scaling to set all the values from 5^th to last column

# in UNIT SCALING, all the data has either 0 or 1 value

ever.Affected = check.Confirmed

# Unit scaling
for (i in row.names(ever.Affected)) {
  for (j in 5:ncol(ever.Affected)) {
    if(ever.Affected[i,j] != 0)
      ever.Affected[i,j] = 1
  }
}

head(ever.Affected)

##     Province.State Country.Region     Lat      Long X1.22.20 X1.23.20
## 1           Others       Thailand 15.0000  101.0000        1        1
## 2           Others          Japan 36.0000  138.0000        1        1
## 3           Others      Singapore  1.2833  103.8333        0        1
## 4           Others          Nepal 28.1667   84.2500        0        0
## 5           Others       Malaysia  2.5000  112.5000        0        0
## 6 British Columbia         Canada 49.2827 -123.1207        0        0
##   X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20 X1.31.20
## 1        1        1        1        1        1        1        1        1
## 2        1        1        1        1        1        1        1        1
## 3        1        1        1        1        1        1        1        1
## 4        0        1        1        1        1        1        1        1
## 5        0        1        1        1        1        1        1        1
## 6        0        0        0        0        1        1        1        1
##   X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20 X2.8.20 X2.9.20
## 1       1       1       1       1       1       1       1       1       1
## 2       1       1       1       1       1       1       1       1       1
## 3       1       1       1       1       1       1       1       1       1
## 4       1       1       1       1       1       1       1       1       1
## 5       1       1       1       1       1       1       1       1       1
## 6       1       1       1       1       1       1       1       1       1
##   X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20 X2.16.20 X2.17.20
## 1        1        1        1        1        1        1        1        1
## 2        1        1        1        1        1        1        1        1
## 3        1        1        1        1        1        1        1        1
## 4        1        1        1        1        1        1        1        1
## 5        1        1        1        1        1        1        1        1
## 6        1        1        1        1        1        1        1        1
##   X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20 X2.23.20 X2.24.20 X2.25.20
## 1        1        1        1        1        1        1        1        1
## 2        1        1        1        1        1        1        1        1
## 3        1        1        1        1        1        1        1        1
## 4        1        1        1        1        1        1        1        1
## 5        1        1        1        1        1        1        1        1
## 6        1        1        1        1        1        1        1        1
##   X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20 X3.2.20 X3.3.20 X3.4.20
## 1        1        1        1        1       1       1       1       1
## 2        1        1        1        1       1       1       1       1
## 3        1        1        1        1       1       1       1       1
## 4        1        1        1        1       1       1       1       1
## 5        1        1        1        1       1       1       1       1
## 6        1        1        1        1       1       1       1       1
##   X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20 X3.10.20 X3.11.20 X3.12.20
## 1       1       1       1       1       1        1        1        1
## 2       1       1       1       1       1        1        1        1
## 3       1       1       1       1       1        1        1        1
## 4       1       1       1       1       1        1        1        1
## 5       1       1       1       1       1        1        1        1
## 6       1       1       1       1       1        1        1        1
##   X3.13.20 X3.14.20 X3.15.20 X3.16.20 X3.17.20 X3.18.20 X3.19.20
## 1        1        1        1        1        1        1        1
## 2        1        1        1        1        1        1        1
## 3        1        1        1        1        1        1        1
## 4        1        1        1        1        1        1        1
## 5        1        1        1        1        1        1        1
## 6        1        1        1        1        1        1        1

Next step is to find and remove the outliers:

We’ll use scatter plots & box plots to identify and compare the MEAN of every day to verify these outliers, so that we can remove them, successfully.

# Let's visualize our data to varify:
options(repr.plot.width=12, repr.plot.height=10)

ggplot(check.Confirmed) +
  geom_point(aes(x=check.Confirmed$Province.State, y=check.Confirmed$X1.29.20), color="red", size=2) +
  theme(
          text = element_text(family = "Gill Sans")
          ,plot.title = element_text(size = 20, face = "bold", hjust = 0.5)
          ,plot.subtitle = element_text(size = 25, family = "Courier", face = "bold", hjust = 0.5)
          ,axis.text = element_text(size = 12)
          ,axis.title = element_text(size = 20)
          ,axis.text.x = element_blank()
  )

#ggplot(check.Deaths) +
#  geom_point(aes(x=check.Deaths$Province.State, y=check.Deaths$X1.29.20), color="red", size=2)#

#ggplot(check.Recovered) +
#  geom_point(aes(x=check.Recovered$Province.State, y=check.Recovered$X1.29.20), color="red", size=2)

# Let's find the name of this outlier:-

check.Confirmed[which(check.Confirmed$X1.29.20 > 400), c("Province.State", "Country.Region")]

##     Province.State Country.Region
## 155          Hubei          China

#check.Deaths[which(check.Deaths$X1.29.20 > 15), c("Province.State", "Country.Region")]
#check.Recovered[which(check.Recovered$X1.29.20 > 20), c("Province.State", "Country.Region")]

So, it’s Hubei.
We’ll verify it by comparison of mean value on daily basis, including and excluding the Hubei province.

# Here we are trying to compare the mean values of everyday, including and excluding Hubei province
With.Hubei = as.numeric(apply(check.Confirmed[,5:ncol(check.Confirmed)], 2, mean))

exceptHubei = check.Confirmed[ which(str_detect(check.Confirmed$Province.State, "Hubei", negate = T)), ]
Without.Hubei = as.numeric(apply(exceptHubei[,5:ncol(exceptHubei)], 2, mean))

# creating a dataframe for comperision
Mean.Comparison.Table = data.frame(
              "Date" = as.character(colnames(check.Confirmed)[5:ncol(check.Confirmed)]),
              "With Hubei" = c(With.Hubei),
              "Without Hubei" = c(Without.Hubei))

tail(Mean.Comparison.Table, 10)

##        Date With.Hubei Without.Hubei
## 49 X3.10.20   253.5824      108.7189
## 50 X3.11.20   269.1563      124.2983
## 51 X3.12.20   274.4625      129.5987
## 52 X3.13.20   310.5567      165.7597
## 53 X3.14.20   333.8865      189.1309
## 54 X3.15.20   358.1949      213.4828
## 55 X3.16.20   388.4154      243.7597
## 56 X3.17.20   421.7794      277.1931
## 57 X3.18.20   459.9315      315.4249
## 58 X3.19.20   519.3191      374.9399

So it’s clear that the Hubei is the outlier!

# let's remove Hubei from our dataset:
Hubei.Confirmed = check.Confirmed[ which(str_detect(check.Confirmed$Province.State, "Hubei", negate = F)), ]
check.Confirmed = check.Confirmed[ which(str_detect(check.Confirmed$Province.State, "Hubei", negate = T)), ]

Hubei.Deaths = check.Deaths[ which(str_detect(check.Deaths$Province.State, "Hubei", negate = F)),]
check.Deaths = check.Deaths[ which(str_detect(check.Deaths$Province.State, "Hubei", negate = T)), ]

Hubei.Recovered = check.Recovered[ which(str_detect(check.Recovered$Province.State, "Hubei", negate = F)), ]
check.Recovered = check.Recovered[ which(str_detect(check.Recovered$Province.State, "Hubei", negate = T)), ]


## Rectifying Row sequences
row.names(check.Confirmed) <- NULL
row.names(check.Deaths) <- NULL
row.names(check.Recovered) <- NULL

Hubei.Confirmed

##     Province.State Country.Region     Lat     Long X1.22.20 X1.23.20
## 155          Hubei          China 30.9756 112.2707      444      444
##     X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20
## 155      549      761     1058     1423     3554     3554     4903
##     X1.31.20 X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20
## 155     5806    7153   11177   13522   16678   19665   22112   24953
##     X2.8.20 X2.9.20 X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20
## 155   27100   29631    31728    33366    33366    48206    54406    56249
##     X2.16.20 X2.17.20 X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20
## 155    58182    59989    61682    62031    62442    62662    64084
##     X2.23.20 X2.24.20 X2.25.20 X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20
## 155    64084    64287    64786    65187    65596    65914    66337   66907
##     X3.2.20 X3.3.20 X3.4.20 X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20
## 155   67103   67217   67332   67466   67592   67666   67707   67743
##     X3.10.20 X3.11.20 X3.12.20 X3.13.20 X3.14.20 X3.15.20 X3.16.20
## 155    67760    67773    67781    67786    67790    67794    67798
##     X3.17.20 X3.18.20 X3.19.20
## 155    67799    67800    67800

# Let's check the once dimention more

# also checking dimention
cat("\nEarlier dimention: 467 X 62\n\n")    # after removing Cruis Ship

## 
## Earlier dimention: 467 X 62

cat("New dimention: ", dim(check.Confirmed))

## New dimention:  466 62

# Let's visualize once more
options(repr.plot.width=12, repr.plot.height=10)

ggplot(check.Confirmed) +
  geom_point(aes(x=check.Confirmed$Province.State, y=check.Confirmed$X1.29.20), color="red", size=2) +
  theme(
          text = element_text(family = "Gill Sans")
          ,plot.title = element_text(size = 20, face = "bold", hjust = 0.5)
          ,plot.subtitle = element_text(size = 25, family = "Courier", face = "bold", hjust = 0.5)
          ,axis.text = element_text(size = 12)
          ,axis.title = element_text(size = 20)
          ,axis.text.x = element_blank()
  )

cat("\n\n")

Although now it’s comparatively better, still have some outliers…

# Let's find them out, too:-
check.Confirmed[which(check.Confirmed$X1.29.20 > 100), c("Province.State", "Country.Region")]

##     Province.State Country.Region
## 158      Guangdong          China
## 159          Henan          China
## 160       Zhejiang          China
## 161          Hunan          China
## 162          Anhui          China
## 163        Jiangxi          China
## 164       Shandong          China
## 166      Chongqing          China
## 167        Sichuan          China
## 170        Beijing          China

# Checking for mean comperision
With.China = as.numeric(apply(check.Confirmed[,5:ncol(check.Confirmed)], 2, mean))

exceptChina = check.Confirmed[ which(str_detect(check.Confirmed$Country.Region, "China", negate = T)), ]
Without.China = as.numeric(apply(exceptChina[,5:ncol(exceptChina)], 2, mean))

# comperision
Mean.Comparison.Table = data.frame(
              "Date" = as.character(colnames(check.Confirmed)[5:ncol(check.Confirmed)]),
              "With China" = c(With.China),
              "Without China" = c(Without.China))

head(Mean.Comparison.Table, 10)

##        Date With.China Without.China
## 1  X1.22.20  0.2381974    0.01612903
## 2  X1.23.20  0.4506438    0.02534562
## 3  X1.24.20  0.8412017    0.04838710
## 4  X1.25.20  1.4442060    0.06451613
## 5  X1.26.20  2.2746781    0.09907834
## 6  X1.27.20  3.2274678    0.11520737
## 7  X1.28.20  4.3433476    0.15898618
## 8  X1.29.20  5.6051502    0.18202765
## 9  X1.30.20  7.1480687    0.21428571
## 10 X1.31.20  8.8454936    0.29032258

So, while talking about the whole world, the complete mainland of China seems to be the outlier. Although, we’ll have to verify it first.

But because, it’s not a single row, we will perform this action later i.e. during data-transformation.

* Data transformation

# We've already saved the cleaned version of the all the files
# Loading the files in order to transform the dataset(s)

# loading raw data - from source
Confirmed = read.csv("../static/cleaned/time_series_19-covid-Confirmed.csv")
Deaths = read.csv("../static/cleaned/time_series_19-covid-Deaths.csv")
Recovered = read.csv("../static/cleaned/time_series_19-covid-Recovered.csv")

Hubei.Confirmed = read.csv("../static/cleaned/Hubei/time_series_19-covid-Confirmed.csv")
Hubei.Deaths = read.csv("../static/cleaned/Hubei/time_series_19-covid-Deaths.csv")
Hubei.Recovered = read.csv("../static/cleaned/Hubei/time_series_19-covid-Recovered.csv")

Diamond.Princess.Confirmed = read.csv("../static/cleaned/Diamond-Princess/time_series_19-covid-Confirmed.csv")
Diamond.Princess.Deaths = read.csv("../static/cleaned/Diamond-Princess/time_series_19-covid-Deaths.csv")
Diamond.Princess.Recovered = read.csv("../static/cleaned/Diamond-Princess/time_series_19-covid-Recovered.csv")

# as known, all of these files have same set of columns,
# the only things that differ are data values in dates' columns

# Let's see any one dataset's structure (as all are similer)
str(Hubei.Recovered)

## 'data.frame':    1 obs. of  62 variables:
##  $ State         : Factor w/ 1 level "Hubei": 1
##  $ Country.Region: Factor w/ 1 level "China": 1
##  $ Lat           : num 31
##  $ Long          : num 112
##  $ X1.22.20      : int 28
##  $ X1.23.20      : int 28
##  $ X1.24.20      : int 31
##  $ X1.25.20      : int 32
##  $ X1.26.20      : int 42
##  $ X1.27.20      : int 45
##  $ X1.28.20      : int 80
##  $ X1.29.20      : int 88
##  $ X1.30.20      : int 90
##  $ X1.31.20      : int 141
##  $ X2.1.20       : int 168
##  $ X2.2.20       : int 295
##  $ X2.3.20       : int 386
##  $ X2.4.20       : int 522
##  $ X2.5.20       : int 633
##  $ X2.6.20       : int 817
##  $ X2.7.20       : int 1115
##  $ X2.8.20       : int 1439
##  $ X2.9.20       : int 1795
##  $ X2.10.20      : int 2222
##  $ X2.11.20      : int 2639
##  $ X2.12.20      : int 2686
##  $ X2.13.20      : int 3459
##  $ X2.14.20      : int 4774
##  $ X2.15.20      : int 5623
##  $ X2.16.20      : int 6639
##  $ X2.17.20      : int 7862
##  $ X2.18.20      : int 9128
##  $ X2.19.20      : int 10337
##  $ X2.20.20      : int 11788
##  $ X2.21.20      : int 11881
##  $ X2.22.20      : int 15299
##  $ X2.23.20      : int 15343
##  $ X2.24.20      : int 16748
##  $ X2.25.20      : int 18971
##  $ X2.26.20      : int 20969
##  $ X2.27.20      : int 23383
##  $ X2.28.20      : int 26403
##  $ X2.29.20      : int 28993
##  $ X3.1.20       : int 31536
##  $ X3.2.20       : int 33934
##  $ X3.3.20       : int 36208
##  $ X3.4.20       : int 38557
##  $ X3.5.20       : int 40592
##  $ X3.6.20       : int 42033
##  $ X3.7.20       : int 43500
##  $ X3.8.20       : int 45235
##  $ X3.9.20       : int 46488
##  $ X3.10.20      : int 47743
##  $ X3.11.20      : int 49134
##  $ X3.12.20      : int 50318
##  $ X3.13.20      : int 51553
##  $ X3.14.20      : int 52960
##  $ X3.15.20      : int 54288
##  $ X3.16.20      : int 55142
##  $ X3.17.20      : int 56003
##  $ X3.18.20      : int 56927
##  $ X3.19.20      : int 57682

Now, recalling the Problem Statement, we aim to find out the status of COVID-19 in China, within next 7 days
In order to do so, we need to analyse the status of COVID-19 on all the previous days

What would it tell us?

By this, we’d be capable enough to make an estimate by what RATE the Coronavirus is spreading since late January.

Hence, we need to transform the data in order:

such that rows hold every data Country wise, instead of State wise

to include a new column “Date” to store aggregate data (of 3 datasets) in a single place

remove unnecessary columns i.e. States, Latitude & Longitude

Arranging data Country-Wise

Steps:

# We need Countries' data:

# It's because: many states have very few cases
tail(Confirmed)

##     Province.State Country.Region      Lat     Long X1.22.20 X1.23.20
## 461  New Caledonia         France -20.9043 165.6180        0        0
## 462        Bermuda United Kingdom  32.3078 -64.7505        0        0
## 463         Others           Chad  15.4542  18.7322        0        0
## 464         Others    El Salvador  13.7942 -88.8965        0        0
## 465         Others           Fiji -17.7134 178.0650        0        0
## 466         Others      Nicaragua  12.8654 -85.2072        0        0
##     X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20
## 461        0        0        0        0        0        0        0
## 462        0        0        0        0        0        0        0
## 463        0        0        0        0        0        0        0
## 464        0        0        0        0        0        0        0
## 465        0        0        0        0        0        0        0
## 466        0        0        0        0        0        0        0
##     X1.31.20 X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20
## 461        0       0       0       0       0       0       0       0
## 462        0       0       0       0       0       0       0       0
## 463        0       0       0       0       0       0       0       0
## 464        0       0       0       0       0       0       0       0
## 465        0       0       0       0       0       0       0       0
## 466        0       0       0       0       0       0       0       0
##     X2.8.20 X2.9.20 X2.10.20 X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20
## 461       0       0        0        0        0        0        0        0
## 462       0       0        0        0        0        0        0        0
## 463       0       0        0        0        0        0        0        0
## 464       0       0        0        0        0        0        0        0
## 465       0       0        0        0        0        0        0        0
## 466       0       0        0        0        0        0        0        0
##     X2.16.20 X2.17.20 X2.18.20 X2.19.20 X2.20.20 X2.21.20 X2.22.20
## 461        0        0        0        0        0        0        0
## 462        0        0        0        0        0        0        0
## 463        0        0        0        0        0        0        0
## 464        0        0        0        0        0        0        0
## 465        0        0        0        0        0        0        0
## 466        0        0        0        0        0        0        0
##     X2.23.20 X2.24.20 X2.25.20 X2.26.20 X2.27.20 X2.28.20 X2.29.20 X3.1.20
## 461        0        0        0        0        0        0        0       0
## 462        0        0        0        0        0        0        0       0
## 463        0        0        0        0        0        0        0       0
## 464        0        0        0        0        0        0        0       0
## 465        0        0        0        0        0        0        0       0
## 466        0        0        0        0        0        0        0       0
##     X3.2.20 X3.3.20 X3.4.20 X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20
## 461       0       0       0       0       0       0       0       0
## 462       0       0       0       0       0       0       0       0
## 463       0       0       0       0       0       0       0       0
## 464       0       0       0       0       0       0       0       0
## 465       0       0       0       0       0       0       0       0
## 466       0       0       0       0       0       0       0       0
##     X3.10.20 X3.11.20 X3.12.20 X3.13.20 X3.14.20 X3.15.20 X3.16.20
## 461        0        0        0        0        0        0        0
## 462        0        0        0        0        0        0        0
## 463        0        0        0        0        0        0        0
## 464        0        0        0        0        0        0        0
## 465        0        0        0        0        0        0        0
## 466        0        0        0        0        0        0        0
##     X3.17.20 X3.18.20 X3.19.20
## 461        0        0        2
## 462        0        0        2
## 463        0        0        1
## 464        0        0        1
## 465        0        0        1
## 466        0        0        1

# Most of the states' name is not identified
unknown = nrow(Recovered[which(str_detect(Recovered$Province.State, "Others")),])
cat(unknown, "/", nrow(Recovered), " States are NOT identified")

## 146 / 466  States are NOT identified

# Ultimatly, any precaution/cure or action is more likely be taken onto the country level, rather than the individual state, as it's the case of a severe Epidemic
# Only then it would be much easier for us to make any possible estimate for the world as well, due to not having really a huge data about each and every single state of the countries.

As we know that Country column is a Factor, we can easily list those countries’, who have reported Confirmed cases (on the daily basis):

Countries = levels(Confirmed$Country.Region)

cat("\nTotal number of affected countries: ", nlevels(Confirmed$Country.Region), "\n\n\nCountries:", head(as.matrix(Countries), 5) )

## 
## Total number of affected countries:  153 
## 
## 
## Countries: Afghanistan Albania Algeria Andorra Antigua and Barbuda

# top 5 countries (in sorted list-namewise)

## Functions for extracting required data


# finds the total cases reported in given country 
    # (by Adding all the data of different states in it)
country.aggregate.daily  <-  function(dfName, country) {
  
  df <- get(dfName)
  df = df[which(str_detect(df$Country.Region, country)),]
  df = cbind(States = df[,1], Country = df[,2], df[,5:ncol(df)])     # ELEMINATING LATITUDE/LONGITUDE Col.
  
  row.names(df) <- NULL    
    
  temp = df                                             # all states' data of a country
  df = temp[1,] 
  
  df[3:ncol(temp)] = apply(   temp[,3:ncol(temp)],
                            2,
                            sum
                        )                               # applying sum of all the states' values
  df = df[2:ncol(df)]                                   # removing column 'States'  
  row.names(df) <- NULL  
  return(df)
}



# generated a dataframe having required data arranged Country-Wise 
    # (by appending every single country's data)
countries.daily <-  function(dfName, cList) {
  
  n = length(cList)       # number of countries
  
  flag = 0
  
  for (i in cList) {
    
    if(flag == 0) {
      df = country.aggregate.daily(dfName, i)
      flag = 1
    } else {
      temp = country.aggregate.daily(dfName, i)
      df = rbind(df, temp)
    }    
  }
  
  row.names(df) <- NULL  
  return(df)
}

China.Confirmed = country.aggregate.daily("Confirmed", "China")
World.Confirmed = countries.daily("Confirmed", Countries)

China.Confirmed

##   Country X1.22.20 X1.23.20 X1.24.20 X1.25.20 X1.26.20 X1.27.20 X1.28.20
## 1   China      104      199      371      645     1017     1454     1955
##   X1.29.20 X1.30.20 X1.31.20 X2.1.20 X2.2.20 X2.3.20 X2.4.20 X2.5.20
## 1     2533     3238     3996    4738    5453    6194    7029    7775
##   X2.6.20 X2.7.20 X2.8.20 X2.9.20 X2.10.20 X2.11.20 X2.12.20 X2.13.20
## 1    8475    9157    9714   10198    10626    11020    11393    11689
##   X2.14.20 X2.15.20 X2.16.20 X2.17.20 X2.18.20 X2.19.20 X2.20.20 X2.21.20
## 1    11952    12164    12331    12445    12529    12588    12635    12888
##   X2.22.20 X2.23.20 X2.24.20 X2.25.20 X2.26.20 X2.27.20 X2.28.20 X2.29.20
## 1    12917    12938    12954    12968    12979    13004    13014    13019
##   X3.1.20 X3.2.20 X3.3.20 X3.4.20 X3.5.20 X3.6.20 X3.7.20 X3.8.20 X3.9.20
## 1   13025   13033   13044   13054   13071   13098   13104   13116   13117
##   X3.10.20 X3.11.20 X3.12.20 X3.13.20 X3.14.20 X3.15.20 X3.16.20 X3.17.20
## 1    13127    13148    13151    13159    13187    13209    13235    13259
##   X3.18.20 X3.19.20
## 1    13303    13357

head(World.Confirmed)

##               Country X1.22.20 X1.23.20 X1.24.20 X1.25.20 X1.26.20
## 1         Afghanistan        0        0        0        0        0
## 2             Albania        0        0        0        0        0
## 3             Algeria        0        0        0        0        0
## 4             Andorra        0        0        0        0        0
## 5 Antigua and Barbuda        0        0        0        0        0
## 6           Argentina        0        0        0        0        0
##   X1.27.20 X1.28.20 X1.29.20 X1.30.20 X1.31.20 X2.1.20 X2.2.20 X2.3.20
## 1        0        0        0        0        0       0       0       0
## 2        0        0        0        0        0       0       0       0
## 3        0        0        0        0        0       0       0       0
## 4        0        0        0        0        0       0       0       0
## 5        0        0        0        0        0       0       0       0
## 6        0        0        0        0        0       0       0       0
##   X2.4.20 X2.5.20 X2.6.20 X2.7.20 X2.8.20 X2.9.20 X2.10.20 X2.11.20
## 1       0       0       0       0       0       0        0        0
## 2       0       0       0       0       0       0        0        0
## 3       0       0       0       0       0       0        0        0
## 4       0       0       0       0       0       0        0        0
## 5       0       0       0       0       0       0        0        0
## 6       0       0       0       0       0       0        0        0
##   X2.12.20 X2.13.20 X2.14.20 X2.15.20 X2.16.20 X2.17.20 X2.18.20 X2.19.20
## 1        0        0        0        0        0        0        0        0
## 2        0        0        0        0        0        0        0        0
## 3        0        0        0        0        0        0        0        0
## 4        0        0        0        0        0        0        0        0
## 5        0        0        0        0        0        0        0        0
## 6        0        0        0        0        0        0        0        0
##   X2.20.20 X2.21.20 X2.22.20 X2.23.20 X2.24.20 X2.25.20 X2.26.20 X2.27.20
## 1        0        0        0        0        1        1        1        1
## 2        0        0        0        0        0        0        0        0
## 3        0        0        0        0        0        1        1        1
## 4        0        0        0        0        0        0        0        0
## 5        0        0        0        0        0        0        0        0
## 6        0        0        0        0        0        0        0        0
##   X2.28.20 X2.29.20 X3.1.20 X3.2.20 X3.3.20 X3.4.20 X3.5.20 X3.6.20
## 1        1        1       1       1       1       1       1       1
## 2        0        0       0       0       0       0       0       0
## 3        1        1       1       3       5      12      12      17
## 4        0        0       0       1       1       1       1       1
## 5        0        0       0       0       0       0       0       0
## 6        0        0       0       0       1       1       1       2
##   X3.7.20 X3.8.20 X3.9.20 X3.10.20 X3.11.20 X3.12.20 X3.13.20 X3.14.20
## 1       1       4       4        5        7        7        7       11
## 2       0       0       2       10       12       23       33       38
## 3      17      19      20       20       20       24       26       37
## 4       1       1       1        1        1        1        1        1
## 5       0       0       0        0        0        0        1        1
## 6       8      12      12       17       19       19       31       34
##   X3.15.20 X3.16.20 X3.17.20 X3.18.20 X3.19.20
## 1       16       21       22       22       22
## 2       42       51       55       59       64
## 3       48       54       60       74       87
## 4        1        2       39       39       53
## 5        1        1        1        1        1
## 6       45       56       68       79       97

China.Deaths = country.aggregate.daily("Deaths", "China")
World.Deaths = countries.daily("Deaths", Countries)
China.Recovered = country.aggregate.daily("Recovered", "China")
World.Recovered = countries.daily("Recovered", Countries)

Moving to the next step

We need date wise data:

It’s so because, we aim to analyse data on the daily basis > Hence, we’d have to add another column “Date” or simply “Day” (to hold day-> 1, 2…)
in order to do so, we’d have to transform our data into Cross-sectional (China, Hubei & Diamond Princess) or Pooled data (Countries of world other than China)

Let’s understand what a Cross-sectional & a Pooled data is:

Cross-sectional data: Data of one or more variables, collected at the same point in time.

Pooled data: A combination of time series data and cross-sectional data.

## Functions

countries.daily.bulk.summary = function(cList) { # date wise country data
  
  # structure of resulting dataset (initially blank)
  df <- data.frame(
    Country = NULL,
    Day = NULL,           # day no.
    Date = NULL,
    Confirmed = NULL,
    Deaths = NULL,
    Recovered = NULL
  )
  
  # calculating all countries' data (date wise) through iteration
  for(i in cList) {
    this.one.confirmed = country.aggregate.daily("Confirmed", i)
    this.one.deaths = country.aggregate.daily("Deaths", i)
    this.one.recovered = country.aggregate.daily("Recovered", i)
    
    times = ncol(this.one.confirmed)-1      # no. of days
    day = 1:times
    d = as.Date("21-01-2020", format(c("%d-%m-%Y")))
    
    date = as.character((day + d), format(c("%d-%m-%Y")))      # its lenngth is equal to --> no. of days
    date = factor(c(date), levels = date)
    
    #max(Deaths.temp[1,5:ncol(Deaths.temp)])
    confirmed = as.numeric(this.one.confirmed[1,2:ncol(this.one.confirmed)])
    
    deaths = as.numeric(this.one.deaths[1,2:ncol(this.one.deaths)])
    
    recovered = as.numeric(this.one.recovered[1,2:ncol(this.one.recovered)])
    
    dataset <- data.frame(
      Country = rep(i, times),
      Day = factor(c(1:length(date)), levels = 1:length(date)),
      Date = date,
      Confirmed = confirmed,
      Deaths = deaths,
      Recovered = recovered
    )
    
    # joining this country
    df = rbind(df, dataset)
  }
    
  return(df)
}

bulk = countries.daily.bulk.summary(Countries)
head(bulk)

##       Country Day       Date Confirmed Deaths Recovered
## 1 Afghanistan   1 22-01-2020         0      0         0
## 2 Afghanistan   2 23-01-2020         0      0         0
## 3 Afghanistan   3 24-01-2020         0      0         0
## 4 Afghanistan   4 25-01-2020         0      0         0
## 5 Afghanistan   5 26-01-2020         0      0         0
## 6 Afghanistan   6 27-01-2020         0      0         0

For better analysis, let’s add 2 more columns:

1. Closed.Cases = consists all cases, that are Expired or Recovered
2. Active.Cases = cases that are neither Expired nor Recovered

bulk$Active.Cases = bulk$Confirmed - (bulk$Deaths + bulk$Recovered)
bulk$Closed.Cases = bulk$Deaths + bulk$Recovered
tail(bulk)

##      Country Day       Date Confirmed Deaths Recovered Active.Cases
## 8869  Zambia  53 14-03-2020         0      0         0            0
## 8870  Zambia  54 15-03-2020         0      0         0            0
## 8871  Zambia  55 16-03-2020         0      0         0            0
## 8872  Zambia  56 17-03-2020         0      0         0            0
## 8873  Zambia  57 18-03-2020         2      0         0            2
## 8874  Zambia  58 19-03-2020         2      0         0            2
##      Closed.Cases
## 8869            0
## 8870            0
## 8871            0
## 8872            0
## 8873            0
## 8874            0

So, our Pooled dataset is ready.
Let’s understand this dataset:-

# Analysing the Pooled data
str(bulk)

## 'data.frame':    8874 obs. of  8 variables:
##  $ Country     : Factor w/ 153 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Day         : Factor w/ 58 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Date        : Factor w/ 58 levels "22-01-2020","23-01-2020",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Confirmed   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Deaths      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Recovered   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Active.Cases: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Closed.Cases: num  0 0 0 0 0 0 0 0 0 0 ...

Explanation of Pooled Datasets (bulk & four)

* Pooled data is a combination of time series data and cross-sectional data

Number of columns: 8
Here we are discussing the Bulk dataset

Country:

Datatype: Factor with 153-levels

Holds the name of Countries for daily data

Eg.: Japan

Day:

Datatype: Factor with 58-levels

Holds days numbered from 1 up to the last day

Eg.: for Jan 22^nd, Day is 1, Jan 23^rd, Day is 2 and so on..

Date:

Datatype: Factor with 58-levels

Holds dates in format dd-mm-yyyy and where individual level has the datatype Date

Eg.: 22-01-2020

Confirmed:

Datatype: num

Holds total number of confirm cases in a country, up to the given date/day

Eg.: up to 01-02-2020, Japan reported 20 COVID-19 cases

Deaths:

Datatype: num

Holds total number of deaths in a country, up to the given date/day

Eg.: up to 01-02-2020, Japan reported no Deaths

Recovered:

Datatype: num

Holds total number of recoveries in a county, up to the given date/day

Eg.: up to 01-02-2020, Japan reported 1 Recoveries

Active.Cases:

Datatype: num

Holds total Confirmed cases, except Deaths & Recoveries in a country, up to the given date/day

Eg.: up to 01-02-2020, Japan had 19 Active cases

Closed.Cases:

Datatype: num

Holds total number of Recoveries or Deaths in a country, up to the given date/day

Eg.: up to 01-02-2020, Japan had closed 1 COVID-19 case

Now we are all set to filter out China from this dataset!

# filtering out the China
China.dataset = bulk[which(str_detect(bulk$Country, 'China')),]

# World Pooled dataset (except china)
bulk = bulk[which(str_detect(bulk$Country, 'China', negate=T)),] # updating bulk itself

head(China.dataset)

##      Country Day       Date Confirmed Deaths Recovered Active.Cases
## 1741   China   1 22-01-2020       104      0         0          104
## 1742   China   2 23-01-2020       199      1         2          196
## 1743   China   3 24-01-2020       371      2         5          364
## 1744   China   4 25-01-2020       645      2         7          636
## 1745   China   5 26-01-2020      1017      4         7         1006
## 1746   China   6 27-01-2020      1454      6        13         1435
##      Closed.Cases
## 1741            0
## 1742            3
## 1743            7
## 1744            9
## 1745           11
## 1746           19

In the same manner, we create two datasets:

holds all the data of all the countries except Hubei in China
holds whole data categorized into four locations.

These four locations are:

Diamond Princess

Hubei province (alone) same as Diamond Princess Cruise Ship

China alone data (Except Hubei province)

World (Except China), collectively

The 2^nd type of dataset is very necessary because it consists of all the outliers as well…
* Actually, here we can take them into consideration because:

Here we are comparing them with the whole World’s data collectively

It’s that kind of MEDICAL Data, where outliers can not be ignored! In-fact this single country and that ship are spreading the disease, rapidly.

This 2^nd dataset alone keeps track on the whole data, reported till the last date

We’ve already saved this dataset

## Load both date wise-datasets (world & FOUR)
# includes data of all the countries
all = read.csv('../static/pooled/countryWise_bulk_summary.csv')

# includes data of four majour location
four = read.csv('../static/pooled/Four_dataset_locationWise.csv')

str(all)

## 'data.frame':    8874 obs. of  8 variables:
##  $ Country     : Factor w/ 153 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Day         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Date        : Factor w/ 58 levels "01-02-2020","01-03-2020",..: 41 43 45 47 49 51 53 55 57 58 ...
##  $ Confirmed   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Deaths      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Recovered   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Active.Cases: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Closed.Cases: int  0 0 0 0 0 0 0 0 0 0 ...

cat("\n\n")

str(four)

## 'data.frame':    232 obs. of  8 variables:
##  $ Location    : Factor w/ 4 levels "China","Diamond Princess",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Day         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Date        : Factor w/ 58 levels "01-02-2020","01-03-2020",..: 41 43 45 47 49 51 53 55 57 58 ...
##  $ Confirmed   : int  444 444 549 761 1058 1423 3554 3554 4903 5806 ...
##  $ Deaths      : int  17 17 24 40 52 76 125 125 162 204 ...
##  $ Recovered   : int  28 28 31 32 42 45 80 88 90 141 ...
##  $ Active.Cases: int  399 399 494 689 964 1302 3349 3341 4651 5461 ...
##  $ Closed.Cases: int  45 45 55 72 94 121 205 213 252 345 ...

In the all dataset, everything is same as in ‘Bulk’ dataset
In the four dataset:

Location:

Datatype: Factor with 4-levels

Holds the name of Locations (as Countries in ‘Bulk’) for daily data

Levels: World, China, Hubei & Diamond Princess

Rest 7 columns are same as those of ‘Bulk’ dataset

Let’s analyse that how China differ from rest of the data using Boxplots

Why Boxplot:-

It’s a single visualization that tells about many statistical quantifiers

It’s very easy to detect Outliers through boxplot

# Initially we plot dataset with majour Locations
options(repr.plot.width=12, repr.plot.height=10)
withChina<-ggplot(four, aes(x=Day, y=Confirmed, color=Day)) +
  geom_boxplot(aes(group=Day)) +
  labs(title="Including China") +
  theme_classic() +
  theme(
          text = element_text(family = "Gill Sans")
          ,plot.title = element_text(size = 20, face = "bold", hjust = 0.5)
          ,plot.subtitle = element_text(size = 25, family = "Courier", face = "bold", hjust = 0.5)
          ,axis.text = element_text(size = 12)
          ,axis.title = element_text(size = 20)
  )

cat("\n\n")

withChina

Here we get a continuous sequence of outliers, for roughly up to 45 days
Now, as per our previous analysis (through word-clouds and mean-comparison, we assumed the China as this outlier)

In order to Test our hypothesis, let’s plot China alone, as well as Rest of all data except China

options(repr.plot.width=12, repr.plot.height=10)

chinaAlone <- ggplot(four[which(str_detect(four$Location, "China", negate=F)),], aes(x=Day, y=Confirmed, color=Day)) +
  geom_point(aes(group=Day)) +
  labs(title="Only China") +
  theme_classic() +
  theme(
          text = element_text(family = "Gill Sans")
          ,plot.title = element_text(size = 20, face = "bold", hjust = 0.5)
          ,plot.subtitle = element_text(size = 25, family = "Courier", face = "bold", hjust = 0.5)
          ,axis.text = element_text(size = 12)
          ,axis.title = element_text(size = 20)
  )

withoutChina <- ggplot(four[which(str_detect(four$Location, "China", negate=T)),], aes(x=Day, y=Confirmed, color=Day)) +
  geom_boxplot(aes(group=Day)) +
  labs(title="Excluding China") +
  theme_classic() +
  theme(
          text = element_text(family = "Gill Sans")
          ,plot.title = element_text(size = 20, face = "bold", hjust = 0.5)
          ,plot.subtitle = element_text(size = 25, family = "Courier", face = "bold", hjust = 0.5)
          ,axis.text = element_text(size = 12)
          ,axis.title = element_text(size = 20)
  )

cat("\n\n")

chinaAlone

cat("\n\n")

withoutChina

Comparing above two plots with our previous single plot of the whole (four) data categorize, collectively:

1. First box plots resembles the sequence, that is far more similar to the Outliers’ sequence 2. Along with this, when we try plotting the whole data again, after removing the China, we find that there is no outlier, at all

So, finally we can say that the China is an outlier, and hence, we’ll study China, separately!

# Let's view few of the rows in existing datasets
head(all)

##       Country Day       Date Confirmed Deaths Recovered Active.Cases
## 1 Afghanistan   1 22-01-2020         0      0         0            0
## 2 Afghanistan   2 23-01-2020         0      0         0            0
## 3 Afghanistan   3 24-01-2020         0      0         0            0
## 4 Afghanistan   4 25-01-2020         0      0         0            0
## 5 Afghanistan   5 26-01-2020         0      0         0            0
## 6 Afghanistan   6 27-01-2020         0      0         0            0
##   Closed.Cases
## 1            0
## 2            0
## 3            0
## 4            0
## 5            0
## 6            0

head(four)

##   Location Day       Date Confirmed Deaths Recovered Active.Cases
## 1    Hubei   1 22-01-2020       444     17        28          399
## 2    Hubei   2 23-01-2020       444     17        28          399
## 3    Hubei   3 24-01-2020       549     24        31          494
## 4    Hubei   4 25-01-2020       761     40        32          689
## 5    Hubei   5 26-01-2020      1058     52        42          964
## 6    Hubei   6 27-01-2020      1423     76        45         1302
##   Closed.Cases
## 1           45
## 2           45
## 3           55
## 4           72
## 5           94
## 6          121

Now, as we aim to analyse the status of COVID-19 within next 10 days, which means that we basically want to analyse the active or closed cases within that time duration.

But, as these 2 are just discrete figures and hence, can vary depending upon the Confirmed, Recovery & Death cases
It means all these figures can vary dynamically

So, in this situation, finding any internal relation between the columns Confirmed/Recovery/Death and Active/Closed cases ain’t an easy task.
Now, in order to establish a relationship between these, the can take Rate of Increase in Active/Closed cases
i.e. what percent (%) of Confirmed cases are Active/Closed, and which would simply be depended upon total Confirmed cases

Hence, before we move towards creating a suitable model for our problem form available dataset, we’d have to do one last transformation, by adding two more columns to our existing dataset i.e.

Active Cases(%)
Closed Cases(%)

# calculate the percent (using Confirmed cases as total)
percent <- function(dfName){
    get(dfName) -> df
    part <- NULL
    
    for(i in 1:nrow(df)) {
        val = df[i,"Active.Cases"]
        Total = df[i,"Confirmed"]
        
        
        if(i == 1)
            if(val==0)
                part = 0
            else
                part = as.numeric((val*100)/Total)
        else
            if(val==0)
                part = c(part, 0)
            else
                part <- c(part, as.numeric((val*100)/Total))
    }
        
    return(part)
}

# CASES -> percentage
four$'percent_active' = percent("four")     # Active cases, out of every 100 Confirmed cases
four$'percent_closed' = 100-percent("four") # Closed cases, out of every 100 Confirmed cases


all$'percent_active' = percent("all")     # Active cases, out of every 100 Confirmed cases
all$'percent_closed' = 100-percent("all") # Closed cases, out of every 100 Confirmed cases

# Look onto the structure whether the things are updated or not
str(all)

## 'data.frame':    8874 obs. of  10 variables:
##  $ Country       : Factor w/ 153 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Day           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Date          : Factor w/ 58 levels "01-02-2020","01-03-2020",..: 41 43 45 47 49 51 53 55 57 58 ...
##  $ Confirmed     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Deaths        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Recovered     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Active.Cases  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Closed.Cases  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ percent_active: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ percent_closed: num  100 100 100 100 100 100 100 100 100 100 ...

cat("\n\n")

str(four)

## 'data.frame':    232 obs. of  10 variables:
##  $ Location      : Factor w/ 4 levels "China","Diamond Princess",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Day           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Date          : Factor w/ 58 levels "01-02-2020","01-03-2020",..: 41 43 45 47 49 51 53 55 57 58 ...
##  $ Confirmed     : int  444 444 549 761 1058 1423 3554 3554 4903 5806 ...
##  $ Deaths        : int  17 17 24 40 52 76 125 125 162 204 ...
##  $ Recovered     : int  28 28 31 32 42 45 80 88 90 141 ...
##  $ Active.Cases  : int  399 399 494 689 964 1302 3349 3341 4651 5461 ...
##  $ Closed.Cases  : int  45 45 55 72 94 121 205 213 252 345 ...
##  $ percent_active: num  89.9 89.9 90 90.5 91.1 ...
##  $ percent_closed: num  10.14 10.14 10.02 9.46 8.88 ...

OK! Everything is set. Now we can go for the model creation…

Modelling:

EXTRACTING CHINA

CHOOSING THE RIGHT ALGO.

DATA SPLITTING: Train-Test

ALGORITHMS

Support Vector Machine - kernal Regression

k-Nearest Neighbour Regression

Linear Regression

Polynomial Regression

Now we have the data of all the different countries as well as the aggregate date-wise data of the whole world, too.
Also, we have data of some special locations, lying far away from the trend (say outliers)

So, we have to decide that which Country, we are going to do our analysis
It can be chosen with the help of the following function

# extracting the desired dataset
extractDatases <- function(region){
    if(region %in% c("Hubei", "World", "Diamond Princess")) {
    temp = four[which(str_detect(four$Location, region)),]
    row.names(temp) <- NULL
} else {
    temp = all[which(str_detect(all$Country, region)),]
    row.names(temp) <- NULL
}

return(temp)
}

Extracting China

# country i.e. to be used throughout the analysis
rName = "China" # without hubei

# filtering out desired country/location 
region1 = extractDatases(rName)

# so we are missing something, when we have outliers saperatly,
    # better is that we join Hubei data in China so that our Countries' dataset don't have any vulnarability

# joining Hubei for complete data of china
region2 = extractDatases("Hubei")
region = cbind(region1[,1:3], region1[,4:8]+region2[,4:8])

# filtering out desired country/location 

region$'percent_active' = percent("region")     # Active cases, out of every 100 Confirmed cases
region$'percent_closed' = 100-percent("region") # Closed cases, out of every 100 Confirmed cases

head(region)

##   Country Day       Date Confirmed Deaths Recovered Active.Cases
## 1   China   1 22-01-2020       548     17        28          503
## 2   China   2 23-01-2020       643     18        30          595
## 3   China   3 24-01-2020       920     26        36          858
## 4   China   4 25-01-2020      1406     42        39         1325
## 5   China   5 26-01-2020      2075     56        49         1970
## 6   China   6 27-01-2020      2877     82        58         2737
##   Closed.Cases percent_active percent_closed
## 1           45       91.78832       8.211679
## 2           48       92.53499       7.465008
## 3           62       93.26087       6.739130
## 4           81       94.23898       5.761024
## 5          105       94.93976       5.060241
## 6          140       95.13382       4.866180

Because, it is the dataset of China, the Country column is not necessary.
Similarly, there is no need of Date, when we have Day

region=region[,c(-1, -3)]
head(region, 10)

##    Day Confirmed Deaths Recovered Active.Cases Closed.Cases percent_active
## 1    1       548     17        28          503           45       91.78832
## 2    2       643     18        30          595           48       92.53499
## 3    3       920     26        36          858           62       93.26087
## 4    4      1406     42        39         1325           81       94.23898
## 5    5      2075     56        49         1970          105       94.93976
## 6    6      2877     82        58         2737          140       95.13382
## 7    7      5509    131       101         5277          232       95.78871
## 8    8      6087    133       120         5834          253       95.84360
## 9    9      8141    171       135         7835          306       96.24125
## 10  10      9802    213       214         9375          427       95.64375
##    percent_closed
## 1        8.211679
## 2        7.465008
## 3        6.739130
## 4        5.761024
## 5        5.060241
## 6        4.866180
## 7        4.211291
## 8        4.156399
## 9        3.758752
## 10       4.356254

Now our Dataset is ready for modelling

Choosing the Right Algo.

# setting the theme
theme_set(theme_classic())
# setting plot size
options(repr.plot.width=8, repr.plot.height=10)

We are working onto a Predictive Data Analysis (as discussed earlier) project.
So, in this situation, we can choose between two types of predictive algorithms, based on our objective.

These 2 categories are:-

Classification

Regression

To estimate the future status of COVID-19 cases in China - we’ll be using Regression

* Why Regression and not Classification?

Classification:
- It is used to categorize the given data-points, there are discrete (i.e. selective) number of the categories.
- Every new data point (for which the estimation is being made) must belong to any of the existing class/category, only.
Regression:
- It predicts the new possible outcome, based on the earlier trend.
- There is NO such necessity for the predicted value that it must be one among the given set of categories.

We want to calculate that what might be the upcoming figure for Active Cases’ % in China, particularly; that ain’t be limited values.

That’s why, Regression has to be used!!

We will compare mainly 3 regression algorithms to predict the required value from rest all the columns.
Then based onto the accuracy of the prediction using different columns, we’ll choose the column that has to be used for prediction.

## REGRASSIONs

# converting from Factor
region$Day <- as.numeric(as.character(region$Day))

x <- as.matrix(region[,1:6, 8])
y <- as.matrix(region[,7])

end = "\n\n#############################################################\n\n\n"

cat("\n\nLinear Regression:\n------------------\n")

## 
## 
## Linear Regression:
## ------------------

# fit model
fit <- lm(percent_active~., region)
# summarize the fit
summary(fit)

## 
## Call:
## lm(formula = percent_active ~ ., data = region)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.142e-13 -5.385e-15 -1.048e-15  7.607e-15  5.031e-14 
## 
## Coefficients: (2 not defined because of singularities)
##                  Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)     1.000e+02  2.119e-14  4.718e+15  < 2e-16 ***
## Day             6.411e-15  1.560e-15  4.109e+00 0.000141 ***
## Confirmed      -1.740e-18  9.716e-19 -1.791e+00 0.079073 .  
## Deaths          2.854e-17  2.979e-17  9.580e-01 0.342505    
## Recovered       1.375e-17  4.069e-18  3.379e+00 0.001385 ** 
## Active.Cases           NA         NA         NA       NA    
## Closed.Cases           NA         NA         NA       NA    
## percent_closed -1.000e+00  3.169e-15 -3.155e+14  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.061e-14 on 52 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.615e+31 on 5 and 52 DF,  p-value: < 2.2e-16

# make predictions
predictions <- predict(fit, region)
# summarize accuracy
mse <- sqrt(mean((region$percent_active - predictions)^2))
cat("RMSE: ", mse)

## RMSE:  4.812696e-14

cat(end, "k-Nearest Neighbours:\n--------------------\n")

## 
## 
## #############################################################
## 
## 
##  k-Nearest Neighbours:
## --------------------

# load the libraries
# fit model
fit <- knnreg(x, y, k=3)
# summarize the fit
summary(fit)

##         Length Class  Mode   
## learn   2      -none- list   
## k       1      -none- numeric
## theDots 0      -none- list

# make predictions
predictions <- predict(fit, x)
# summarize accuracy
mse <- sqrt(mean((y - predictions)^2))
cat("RMSE: ", mse)

## RMSE:  0.5710865

cat(end, "Support Vector Machine:\n-----------------------\n")

## 
## 
## #############################################################
## 
## 
##  Support Vector Machine:
## -----------------------

# fit model
fit <- ksvm(percent_active~., region, kernel="rbfdot")
# summarize the fit
summary(fit)

## Length  Class   Mode 
##      1   ksvm     S4

# make predictions
predictions <- predict(fit, region)
# summarize accuracy
mse <- sqrt(mean((y - predictions)^2))
cat("RMSE: ", mse)

## RMSE:  3.059589

#############################################################

From the summary of different regression algorithms above, we find that Day column should be given the priority.

As in the summary of linear regression:

                  | Estimate Std.|  Error    |  t value  |  Pr(>|t|)  
     -------------|--------------|-----------|-----------|-----------
     (Intercept)  |  1.000e+02   | 1.202e-14 | 8.319e+15 |   < 2e-16  ***
     Day          |  5.221e-15   | 1.270e-15 | 4.112e+00 |  0.000140  ***

Though, Active Case(%) can easily be calculated if we know Closed Case(%) or data About Confirmed/Death/Recovered cases data

but we actually don’t have any of these things, because once we know that we’d have the future estimate as well.
and won’t even this whole analysis.

So, finally we know that we should choose Days for Active Case(%).

# Visualizing the available data as a scatter plot to see how the Active Cases(%) in China has varied over days, since 22^nd^ January, 2020

# Day vs Active Cases(%)
region.scatter.plot <- ggplot(region, aes(x = Day, y = percent_active)) +
                        geom_point() +
                        labs( x = "Days", y = "Active Cases (%)") +
                        theme(
                              text = element_text(family = "Gill Sans")
                              ,plot.title = element_text(size = 20, face = "bold", hjust = 0.5)
                              ,plot.subtitle = element_text(size = 25, family = "Courier", face = "bold", hjust = 0.5)
                              ,axis.text = element_text(size = 12)
                              ,axis.title = element_text(size = 20)
                              )
              
region.scatter.plot

From the above visualization, it’s clear that on an average, the active case percentage has kept on decreasing over the days

Now we’ll split our China dataset for training(80%) & testing(20%)

Data Splitting: train-test

set.seed(20) # generages same set of random sample every time

training.samples <- region$Day %>%
  createDataPartition(p = 0.8, list = FALSE)

train.data  <- region[training.samples, ]
test.data <- region[-training.samples, ]

# Dimentions of the splitted datasets
dim(train.data)

## [1] 48  8

dim(test.data)

## [1] 10  8

head(train.data, 3)

##   Day Confirmed Deaths Recovered Active.Cases Closed.Cases percent_active
## 1   1       548     17        28          503           45       91.78832
## 2   2       643     18        30          595           48       92.53499
## 3   3       920     26        36          858           62       93.26087
##   percent_closed
## 1       8.211679
## 2       7.465008
## 3       6.739130

head(test.data, 3)

##    Day Confirmed Deaths Recovered Active.Cases Closed.Cases percent_active
## 4    4      1406     42        39         1325           81       94.23898
## 13  13     19716    425       615        18676         1040       94.72510
## 15  15     27440    563      1115        25762         1678       93.88484
##    percent_closed
## 4        5.761024
## 13       5.274904
## 15       6.115160

Selection of an algorithm

As our training & testing datasets are ready, now we have to select the right regression algorithm to train our model

For this, we will compare four regression algorithms i.e.

Support Vector Machines (kernel) Regression
k-Nearest Neighbour Regression
Linear Regression
Polynomial Regression

What should be compared?

Errors (should be minimized)
Accuracy (should be maximized)
R-Squared (R²) Value
How perfectly the model fits while testing (as in the graphs) etc…

We’ll be comparing 4 regression algorithms including SVMK and kNN, although these two are mainly known for classification problems.

1. Support Vector Machine Regression

# model
fit.svmk <- ksvm(percent_active~Day, train.data, kernel="rbfdot")
#summary(fit.svmk)

predictions <- fit.svmk %>% predict(train.data)

# Model performance
svm.predictions <- data.frame(
  RMSE = RMSE(predictions, train.data$Day),
  R2 = R2(predictions, train.data$Day)
)

svmk.trained = cbind( # Prediction for training data
            train.data[,c("Day", "percent_active")],
            Pridicted_percent_active = predict(fit.svmk, train.data)
          )

svmk.tested = cbind(  # Prediction for tested data
            test.data[,c("Day", "percent_active")],
            Pridicted_percent_active = predict(fit.svmk, test.data)
         )


tail(svmk.trained, 5)

##    Day percent_active Pridicted_percent_active
## 53  53      14.972153                 17.26605
## 55  55      12.224649                 13.81184
## 56  56      11.140171                 12.97371
## 57  57       9.995931                 13.09887
## 58  58       9.084860                 14.37292

tail(svmk.tested, 5)

##    Day percent_active Pridicted_percent_active
## 39  39       46.87610                 49.87923
## 41  41       40.39133                 43.38354
## 47  47       25.15992                 28.07431
## 50  50       19.91572                 23.13298
## 54  54       13.31185                 15.34209

# Visualizations

svmk.trainer <- ggplot(train.data, aes(Day, percent_active) ) +
              geom_point() +
              geom_smooth(data=svmk.trained, method="loess", size=0) +
              geom_line(data = svmk.trained, aes(Day, Pridicted_percent_active), color="#3366fe", size=1) +   # polynomial function
  
              # decoration
              labs( x = "Days", y = "Active Cases (%)", title = paste("\nTraining plot", rName, sep = " - ") ) +
              theme( plot.title = element_text(size = 20, face = "bold")) 

svmk.tester <- ggplot(test.data, aes(Day, percent_active) ) +
              geom_point() +
              geom_smooth(data=svmk.tested, method="loess", size=0) +
              geom_line(data = svmk.tested, aes(Day, Pridicted_percent_active), color="#3366fe", size=1) +   # polynomial function
  
              # decoration
              labs( x = "Days", y = "Active Cases (%)", title = paste("\nTesting plot", rName, sep = " - ") ) +
              theme( plot.title = element_text(size = 20, face = "bold"))

#  Plotting
svmk.trainer

## `geom_smooth()` using formula 'y ~ x'

svmk.tester

## `geom_smooth()` using formula 'y ~ x'

2. k-Nearest Neighbour Regression

x <- as.matrix(train.data[,1])
y <- as.matrix(train.data[,7])

# fit model
fit.knn <- knnreg(x, y, k=3)
#summary(fit.knn)

predictions <- fit.knn %>% predict(x)

knn.predictions <- data.frame(
  RMSE = RMSE(predictions, x),
  R2 = R2(predictions, x)
)

knn.trained = cbind( # Prediction for training data
            train.data[,c("Day", "percent_active")],
            Pridicted_percent_active = predict(fit.knn, as.matrix(train.data[,1]))
          )

knn.tested = cbind(  # Prediction for tested data
            test.data[,c("Day", "percent_active")],
            Pridicted_percent_active = predict(fit.knn, as.matrix(test.data[,1]))
         )


tail(knn.trained, 5)

##    Day percent_active Pridicted_percent_active
## 53  53      14.972153                 15.57997
## 55  55      12.224649                 12.08323
## 56  56      11.140171                 11.12025
## 57  57       9.995931                 10.07365
## 58  58       9.084860                 10.07365

tail(knn.tested, 5)

##    Day percent_active Pridicted_percent_active
## 39  39       46.87610                 49.56913
## 41  41       40.39133                 38.31875
## 47  47       25.15992                 25.48533
## 50  50       19.91572                 20.06390
## 54  54       13.31185                 13.77505

# Visualizations
knn.trainer <- ggplot(train.data, aes(Day, percent_active) ) +
              geom_point() +
              geom_smooth(data=knn.tested, method="loess", size=0) +
              geom_line(data = knn.trained, aes(Day, Pridicted_percent_active), color="#3366fe", size=1) +   # polynomial function
  
              # decoration
              labs( x = "Days", y = "Active Cases (%)", title = paste("\nTraining plot", rName, sep = " - ") ) +
              theme( plot.title = element_text(size = 20, face = "bold")) 

knn.tester <- ggplot(test.data, aes(Day, percent_active) ) +
              geom_point() +
              geom_smooth(data=knn.tested, method="loess", size=0) +
              geom_line(data = knn.tested, aes(Day, Pridicted_percent_active), color="#3366fe", size=1) +   # polynomial function
  
              # decoration
              labs( x = "Days", y = "Active Cases (%)", title = paste("\nTesting plot", rName, sep = " - ") ) +
              theme( plot.title = element_text(size = 20, face = "bold"))

# Plotting
knn.trainer

## `geom_smooth()` using formula 'y ~ x'

knn.tester

## `geom_smooth()` using formula 'y ~ x'

3. Linear Regression

# building linear model
fit.lm = lm(Day ~ percent_active, data = train.data)
#summary(fit.lm)

# Predicting
predictions <- fit.lm %>% predict(train.data)

# Model performance
lm.predictions <- data.frame(
  RMSE = RMSE(predictions, train.data$Day),
  R2 = R2(predictions, train.data$Day)
)

lm.trained = cbind( # Prediction for training data
            train.data[,c("Day", "percent_active")],
            Pridicted_percent_active = predict(fit.lm, train.data)
          )

lm.tested = cbind(  # Prediction for tested data
            test.data[,c("Day", "percent_active")],
            Pridicted_percent_active = predict(fit.lm, test.data)
         )


tail(lm.trained, 5)

##    Day percent_active Pridicted_percent_active
## 53  53      14.972153                 54.83990
## 55  55      12.224649                 56.27379
## 56  56      11.140171                 56.83977
## 57  57       9.995931                 57.43693
## 58  58       9.084860                 57.91241

tail(lm.tested, 5)

##    Day percent_active Pridicted_percent_active
## 39  39       46.87610                 38.18960
## 41  41       40.39133                 41.57392
## 47  47       25.15992                 49.52302
## 50  50       19.91572                 52.25991
## 54  54       13.31185                 55.70639

# Visualizations
lm.trainer <- ggplot(train.data, aes(Day, percent_active) ) +
              geom_point() +
              stat_smooth(method = lm, formula = y ~ x) +   # linear function
  
              # decoration
              labs( x = "Days", y = "Active Cases (%)", title = paste("\nTraining plot", rName, sep = " - ") ) +
              theme( plot.title = element_text(size = 20, face = "bold")) 

lm.tester <- ggplot(test.data, aes(Day, percent_active) ) +
              geom_point() +
              stat_smooth(method = lm, formula = y ~ x) +   # linear function
  
              # decoration
              labs( x = "Days", y = "Active Cases (%)", title = paste("\nTesting plot", rName, sep = " - ") ) +
              theme( plot.title = element_text(size = 20, face = "bold"))

# Plotting
lm.trainer

lm.tester

4. Polynomial regression

Checking for the BEST degree…

# Model performance
plm.predictions = data.frame(
  Degree = NULL,
  RMSE = NULL,
  RSE = NULL,
  R2 = NULL
)


for(deg in 1:20){
    
    # building polynomial model
    fit.plm = lm(percent_active ~ poly(Day, deg, raw = TRUE), data = train.data)
    #summary(fit.plm)

    
    #Residual Standard error (Like Standard Deviation)
    k=length(fit.plm$coefficients)-1
    #Multiple R-Squared (Coefficient of Determination)
    SSyy=sum((train.data$percent_active-mean(train.data$percent_active))**2)
    
    SSE=sum(fit.plm$residuals**2)
    n=length(fit.plm$residuals)
    
    
    # final
    rmse = sqrt(SSE/(n-1))
    rse = sqrt(SSE/(n-(1+k))) #Residual Standard Error
    r2 = (SSyy-SSE)/SSyy
    
    temp <- data.frame(
                        Degree = deg,
                        RMSE = rmse,
                        RSE = rse,
                        R2 = r2
                      )
    
    plm.predictions = rbind(plm.predictions, temp)
}

plm.predictions

##    Degree      RMSE       RSE        R2
## 1       1 9.1213834 9.2199958 0.9131809
## 2       2 4.6663640 4.7689337 0.9772778
## 3       3 2.2460160 2.3213223 0.9947360
## 4       4 1.2378209 1.2941140 0.9984011
## 5       5 1.2360031 1.3075064 0.9984058
## 6       6 0.7801015 0.8352339 0.9993650
## 7       7 0.7221723 0.7828161 0.9994558
## 8       8 0.6092638 0.6688397 0.9996126
## 9       9 0.5836966 0.6491489 0.9996445
## 10     10 0.5823322 0.6563249 0.9996461
## 11     11 0.5819927 0.6649901 0.9996465
## 12     12 0.5591038 0.6478991 0.9996738
## 13     13 0.5591038 0.6573579 0.9996738
## 14     14 0.5386723 0.6428603 0.9996972
## 15     15 0.5386723 0.6528277 0.9996972
## 16     16 0.5051696 0.6220214 0.9997337
## 17     17 0.5051696 0.6323034 0.9997337
## 18     18 0.4912791 0.6254293 0.9997481
## 19     19 0.4912791 0.6364997 0.9997481
## 20     20 0.4912791 0.6481795 0.9997481

From above, at around Degree 16, model attains R-Squared score = 0.999733, as well as the RMSE is also very low So we’d try taking degree = 16

At Degree = 16

R² = 0.999733

RMSE = 0.5051696

So, taking 16 as the degree of polynomial regression

deg = 16

# building polynomial model
fit.plm = lm(percent_active ~ poly(Day, deg, raw = TRUE), data = train.data)
#summary(fit.plm)

plm.trained = cbind( # Prediction for training data
            train.data[,c("Day", "percent_active")],
            Pridicted_percent_active = predict(fit.plm, train.data)  # predicting the values
          )

plm.tested = cbind(  # Prediction for tested data
            test.data[,c("Day", "percent_active")],
            Pridicted_percent_active = predict(fit.plm, test.data)   # predicting the values
         )


tail(plm.trained, 5)

##    Day percent_active Pridicted_percent_active
## 53  53      14.972153                14.966792
## 55  55      12.224649                12.244624
## 56  56      11.140171                11.129915
## 57  57       9.995931                10.013176
## 58  58       9.084860                 9.076924

tail(plm.tested, 5)

##    Day percent_active Pridicted_percent_active
## 39  39       46.87610                 47.44498
## 41  41       40.39133                 40.45985
## 47  47       25.15992                 25.26067
## 50  50       19.91572                 20.17564
## 54  54       13.31185                 13.49354

# Visualizations of the predictions for Training as well as Testing datasets to see fit
plm.trainer <- ggplot(train.data, aes(Day, percent_active) ) +
              geom_point() +
              stat_smooth(method = lm, formula = y ~ poly(x, deg, raw = TRUE)) +   # polynomial function
  
              # decoration
              labs( x = "Days", y = "Active Cases (%)", title = paste("\nTraining plot", rName, sep = " - ") ) +
              theme( plot.title = element_text(size = 20, face = "bold")) 

plm.tester <- ggplot(test.data, aes(Day, percent_active) ) +
              geom_point() +
              stat_smooth(method = lm, formula = y ~ poly(x, deg, raw = TRUE)) +   # polynomial function
  
              # decoration
              labs( x = "Days", y = "Active Cases (%)", title = paste("\nTesting plot", rName, sep = " - ") ) +
              theme( plot.title = element_text(size = 20, face = "bold"))

# Plotting
plm.trainer

plm.tester

So we can see, here our model fits best for a polynomial regression having degree = 16

We already have the data of Day 59: i.e. 8.285436
To finalize the model, we will perform a final testing by predicting the Active Cases(%) for Day: 59, 60, 61

Creating A temporary dataset for this prediction

# Available data for 59th day (i.e. 20-03-2020)
temp.test = data.frame(
                        Day = 59:61,
                        Confirmed = c(81251, NaN, NaN),
                        Deaths = c(3253, NaN, NaN),
                        Recovered = c(71266, NaN, NaN),
                        Active.Cases = c(6732, NaN, NaN),
                        Closed.Cases = c(74519, NaN, NaN),
                        percent_active =  c(8.285436, NaN, NaN),
                        percent_closed = c(91.7145, NaN, NaN)
                     )

temp.test

##   Day Confirmed Deaths Recovered Active.Cases Closed.Cases percent_active
## 1  59     81251   3253     71266         6732        74519       8.285436
## 2  60       NaN    NaN       NaN          NaN          NaN            NaN
## 3  61       NaN    NaN       NaN          NaN          NaN            NaN
##   percent_closed
## 1        91.7145
## 2            NaN
## 3            NaN

# Now, we'll find the Active case(%) from all the four algorithms
temp.required <- temp.test[,c(1, 7)]
temp.required$"Polynomial Model" <- predict(fit.plm, temp.test)

temp.required

##   Day percent_active Polynomial Model
## 1  59       8.285436         9.781559
## 2  60            NaN        16.898091
## 3  61            NaN        42.345397

Here we see a high rise in the predicted values of Active Case(%), which seems very strange because we already know that China has been very successful to have control over this epidemic.
* It means that - although our model fits best in this scenario, yet there’s some problem behind this sudden rise in new Cases.

Then how our model fits so perfectly??

Here comes the term: Model Overfitting!

Because, we have trained our model to perform such a higher degree, it predicts almost the actual values.

That’s why, though our model fits very accurate on the training and testing (that has value of dependent variable belonging to the same domain i.e. [1,58])

Yet, whenever it gets any new value (60, 61..) due to overfitting, our model fails to give right prediction.

In overfitting, a model works with almost 100% accuracy, but fails to give right output on unseen data.

Hence, we’ll have to choose some other degree.
From the degree accuracy table (we generated early), having a look for Degree = 11

plm.predictions[which(plm.predictions$Degree == 11),]

##    Degree      RMSE       RSE        R2
## 11     11 0.5819927 0.6649901 0.9996465

It seems a little better, so we’ll choose 11 as our degree of polynomial regression.

deg = 11

# building polynomial model
fit.plm = lm(percent_active ~ poly(Day, deg, raw = TRUE), data = train.data)
#summary(fit.plm)

########################

## Prediction table
plm.trained = cbind( # Prediction for training data
            train.data[,c("Day", "percent_active")],
            Pridicted_percent_active = predict(fit.plm, train.data)  # predicting the values
          )

plm.tested = cbind(  # Prediction for tested data
            test.data[,c("Day", "percent_active")],
            Pridicted_percent_active = predict(fit.plm, test.data)   # predicting the values
         )


########################

# Graphs for prediction
plm.trainer <- ggplot(train.data, aes(Day, percent_active) ) +
              geom_point() +
              stat_smooth(method = lm, formula = y ~ poly(x, deg, raw = TRUE)) +   # polynomial function
  
              # decoration
              labs( x = "Days", y = "Active Cases (%)", title = paste("\nTraining plot", rName, sep = " - ") ) +
              theme( plot.title = element_text(size = 20, face = "bold")) 

plm.tester <- ggplot(test.data, aes(Day, percent_active) ) +
              geom_point() +
              stat_smooth(method = lm, formula = y ~ poly(x, deg, raw = TRUE)) +   # polynomial function
  
              # decoration
              labs( x = "Days", y = "Active Cases (%)", title = paste("\nTesting plot", rName, sep = " - ") ) +
              theme( plot.title = element_text(size = 20, face = "bold"))

# Prediction table at degree = 6
tail(plm.trained, 5)

##    Day percent_active Pridicted_percent_active
## 53  53      14.972153                15.369157
## 55  55      12.224649                12.339328
## 56  56      11.140171                10.925893
## 57  57       9.995931                 9.793102
## 58  58       9.084860                 9.238991

tail(plm.tested, 5)

##    Day percent_active Pridicted_percent_active
## 39  39       46.87610                 47.21327
## 41  41       40.39133                 40.63074
## 47  47       25.15992                 24.96403
## 50  50       19.91572                 19.83759
## 54  54       13.31185                 13.85174

# New visualization for prediction
plm.trainer

plm.tester

temp.required$"Polynomial Model" <- predict(fit.plm, temp.test)
temp.required

##   Day percent_active Polynomial Model
## 1  59       8.285436         9.710642
## 2  60            NaN        11.840712
## 3  61            NaN        16.487469

So, now it’s much better!

Evaluation:

COMPARISON

FINALIZED ALGORITHM

Understanding R² & RMSE

R vs R²

R:

Correlation b/w independent & dependent variables

Varies between [-1, 1]

Measures Linear Relation b/w 2 (x,y) quantitative var. based on Direction & Strength

R²:

It’s nothing other than RxR

Correlation b/w x(independent var.) & y(dependent var.)

Varies between [0, 1]

How close the data points are to the Regression line i.e. Goodness of fit that further means that how good/correct the predictions are.

When R² becomes as close as to 1, Predicted value becomes more accurate.

RMSE

This is:

Root Mean Squared Error or

Root Mean Squared Deviation or

Standard Deviation from the Residuals

Works on the calculation of how far the Actual data-points are as compared to the Regression Line.

Daviation between Predicted & Actual data points reduces with the reduction of RMSE value.

* Comparison: Among Algorithms

# Let's evaluate and compare the 4 model algorithms

# Performance (Accuracy) table
Accuracy.Table = data.frame(
                    Algo = c("Support Vector Machine Regression",
                             "k-Nearest Neighbour Regression",
                             "Linear Regression",
                             "Polynomial Regression"),
                    RMSE = c(svm.predictions$RMSE[1],
                             knn.predictions$RMSE[1],
                             lm.predictions$RMSE[1],
                             plm.predictions[deg, 'RMSE']),
                    R2 = c(svm.predictions$R2[1],
                             knn.predictions$R2[1],
                             lm.predictions$R2[1],
                             plm.predictions[deg, 'R2'])
                 )

Accuracy.Table

##                                Algo       RMSE        R2
## 1 Support Vector Machine Regression 56.3856139 0.9164658
## 2    k-Nearest Neighbour Regression 58.3887824 0.9141718
## 3                 Linear Regression  4.9293341 0.9131809
## 4             Polynomial Regression  0.5819927 0.9996465

# Creating a comparison table to compare the predicted values by all four models
Prediction = cbind(
                "Day" = test.data$Day,
                "Actual Active Case(%)" = test.data$percent_active,
                "Predicted by SVMK" = svmk.tested$Pridicted_percent_active,
                "Predicted by kNN" = knn.tested$Pridicted_percent_active,
                "Predicted by LM" = lm.tested$Pridicted_percent_active,
                "Predicted by Poly LM" = plm.tested$Pridicted_percent_active
             )
Prediction

##       Day Actual Active Case(%) Predicted by SVMK Predicted by kNN
##  [1,]   4              94.23898          92.20595         93.96736
##  [2,]  13              94.72510          91.83658         94.97576
##  [3,]  15              93.88484          90.64834         93.16354
##  [4,]  18              90.75623          88.73618         90.77715
##  [5,]  28              78.15822          75.18151         77.81573
##  [6,]  39              46.87610          49.87923         49.56913
##  [7,]  41              40.39133          43.38354         38.31875
##  [8,]  47              25.15992          28.07431         25.48533
##  [9,]  50              19.91572          23.13298         20.06390
## [10,]  54              13.31185          15.34209         13.77505
##       Predicted by LM Predicted by Poly LM
##  [1,]        13.47147             94.08485
##  [2,]        13.21776             94.64569
##  [3,]        13.65628             93.36389
##  [4,]        15.28907             90.96662
##  [5,]        21.86382             78.35243
##  [6,]        38.18960             47.21327
##  [7,]        41.57392             40.63074
##  [8,]        49.52302             24.96403
##  [9,]        52.25991             19.83759
## [10,]        55.70639             13.85174

* Finalized Algorithm

Among these four algorithms, our study leads us to choose the Polynomial Regression for modelling.

It is so because:

Based on all four models’ performance, it has:

Comparatively better values Root Mean Squared Error (RMSE) and R-Squared (R²) values

Comparatively better predictions

In general:

Polynomial provides the best approximation of the relationship between the dependent and independent variable.

A Broad range of function can be fit under it.

Polynomial basically fits a wide range of curvature, hence, has a better fit.

We have NOT chosen kNN-Regression because:

It is a complex algorithm

kNN usually works better for classification, because it mainly focuses on its surrounding values, for prediction.

# Appending 59th day's data to our training dataset
train.data = rbind(train.data, temp.test[1,])            # RUN ONCE!!

# Renaming row sequence
row.names(train.data) <- NULL
tail(train.data)

##    Day Confirmed Deaths Recovered Active.Cases Closed.Cases percent_active
## 44  53     80977   3193     65660        12124        68853      14.972153
## 45  55     81033   3217     67910         9906        71127      12.224649
## 46  56     81058   3230     68798         9030        72028      11.140171
## 47  57     81103   3241     69755         8107        72996       9.995931
## 48  58     81157   3249     70535         7373        73784       9.084860
## 49  59     81251   3253     71266         6732        74519       8.285436
##    percent_closed
## 44       85.02785
## 45       87.77535
## 46       88.85983
## 47       90.00407
## 48       90.91514
## 49       91.71450

# training once more on updated training dataset
fit.plm = lm(percent_active ~ poly(Day, deg, raw = TRUE), data = train.data)

# At this point, we can clearly see that we should one among Polynomial Regession and KNN algo.
# But, we also know that Polynomial Regression better fits into trend as compared to KNN, as per the visualization, above.
# So, we find that Polynomial regression is best

# Hence, we'll be choosing Polynomial Regression with Degree = 16

# now we are ready to go for the last step in our COVID-19 analysis i.e. Deployment & hence, to get the status of China, in few of the upcoming days:

Deployment:

SUMMARY

FINAL RESULT

* Summary

We have trained our model by Polynomial Regression Algorithm at Degree: 11

# Summary of trained model
summary(fit.plm)

## 
## Call:
## lm(formula = percent_active ~ poly(Day, deg, raw = TRUE), data = train.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.95787 -0.18752  0.02621  0.30405  1.48238 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   9.194e+01  2.239e+00  41.071   <2e-16 ***
## poly(Day, deg, raw = TRUE)1  -5.966e-01  2.646e+00  -0.226    0.823    
## poly(Day, deg, raw = TRUE)2   6.316e-01  1.065e+00   0.593    0.557    
## poly(Day, deg, raw = TRUE)3  -1.298e-01  2.102e-01  -0.618    0.541    
## poly(Day, deg, raw = TRUE)4   1.396e-02  2.373e-02   0.588    0.560    
## poly(Day, deg, raw = TRUE)5  -9.612e-04  1.661e-03  -0.579    0.566    
## poly(Day, deg, raw = TRUE)6   4.437e-05  7.521e-05   0.590    0.559    
## poly(Day, deg, raw = TRUE)7  -1.374e-06  2.243e-06  -0.613    0.544    
## poly(Day, deg, raw = TRUE)8   2.786e-08  4.370e-08   0.638    0.528    
## poly(Day, deg, raw = TRUE)9  -3.523e-10  5.353e-10  -0.658    0.515    
## poly(Day, deg, raw = TRUE)10  2.507e-12  3.738e-12   0.671    0.507    
## poly(Day, deg, raw = TRUE)11 -7.650e-15  1.135e-14  -0.674    0.504    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6593 on 37 degrees of freedom
## Multiple R-squared:  0.9997, Adjusted R-squared:  0.9996 
## F-statistic: 1.005e+04 on 11 and 37 DF,  p-value: < 2.2e-16

Accuracy

plm.predictions[which(plm.predictions$Degree == deg),]

##    Degree      RMSE       RSE        R2
## 11     11 0.5819927 0.6649901 0.9996465

It was all about the overview of our model!

Now we ready to find the estimate of next one week’s status of China about Active Case(%)

# Preparing our dataset to store next 8 days' estimate

d = as.Date("21/03/2020", format(c("%d/%m/%Y")))
date = as.character((c(1:8) + d), format(c("%d/%m/%Y")))

finalEstimate <- data.frame(
                             Day = 59:66,
                             Date = date
                           )

Predicting the estimate for the Active Case percentage, that is likely to be within next 7-8 days:

finalEstimate$"Estimated Active Case(%)" = predict(fit.plm, finalEstimate)

Finally! We have estimated the possible % of Active Cases for next 7-8 days for China

Viewing the result:

finalEstimate

##   Day       Date Estimated Active Case(%)
## 1  59 22/03/2020                 8.400352
## 2  60 23/03/2020                 8.311558
## 3  61 24/03/2020                 8.699806
## 4  62 25/03/2020                 9.441795
## 5  63 26/03/2020                10.174679
## 6  64 27/03/2020                10.159197
## 7  65 28/03/2020                 8.096655
## 8  66 29/03/2020                 1.889395

Conclusion:

So by this whole analysis, we arrive at the conclusion that:

By the end of March, 2020 (on 28^th March), China will no longer have to worry a lot for new cases

China should focus on taking the precautions in most sensitive areas in order to ensure no new COVID-19 case.

They are NOT likely to report any wide range of new cases by the end of March

China should focus more on the treatement of the current patients

China is good to lift the lockdown from it’s locked provinces, after March.

As per the sources, we can witnessed that -

25^th March: Hubei lift the lockdown outside of Wuhan that was the most affected state in the whole China.

China started attempts to lift the lockdown by 28^th-29^th March [The Guardian]

8^th April: China lift 76 days’ lockdown from Wuhan. [source]

Now, a far more COVID-19 cases are closing as compared to those that are being reported, newly.
It is supposed to decrease even more if China focused onto the precautions.

Following word-cloud Visualization tells how rapidly, the China is overcoming the COVID-19

Bibliography:

Data Sources:

World Health Organization

Johns Hopkins University

Ministry of Health and Family Welfare - India

Imperial College of London

Worldometers