Intro

Hey Joel, For this homework I went to the datasets subreddit (it comes in handy a lot in this class) and found a pinned thread on r/datasets related to the coronavirus. From there, I followed the steps given in module 2.2 to pull the table form worldmeter.info to create a dataframe in R. After I got the data set in it, I realised there was a lot of missing data. First, I replaced all missing data with “NA.” Then I deleted the first 9 rows of data because they were all continental measurements and I just wanted to do countries. Then I started to add some dummy variables and started to create some visualizations.

Once I started playing with the data, I realized there was some oddities surrounding China. So the question I was asking myself was what does the rest of the world’s data look like compared to China, the place where covid started?

This is me creating a new datatable that is filtered to have only countries with populations over 20 million.

This is me creating a dummy variable to see what countries are above and below average amount of deaths.

## [1] 859.6957

This is me creating a new data table that is filtered to only contain countries that are above the average amount of deaths

I don’t know what’s going on here. If you check my rmd file or R file my visualization is compeletley different. When I knit the file this observation changes?????? This graph only shows two bars when there should be 10. WTF.

Filtered DT that is for countries below average deaths

this data is very very interesting. When looking at the Graph that shows Below Average Deaths,we see that China, a country with a population of 1,439,323,776, that is ground zero for the virus, only has a reported 4,633 deaths. If you factor in how rural some parts of China are, and that most of China has little to no modern utlities such as electricity, plumbing, urban commodities, etc, this is astounding.

This gets even wierder if you Look at the USA’s data. The USA only has a population of 330,753,490 and has a reported 86,912 deaths. This is more than 18 times higher than China. China, the place where this all started, where there is a lot less access to heathcare… for a virus that has no cure. Either eastern medicene is is the real deal, or the Chinese Government might just be lying to us.

filtered DT showing countrieies in asia with a large population

DT showing only showing contries in ASIA

## Warning: Removed 13 row(s) containing missing values (geom_path).

I made this graph to see what the correlation might be for how many active cases there are in Asia (for countries with a pop over 20,00,000) and how many patients are on their death bed. The corelation seems to be the the more patients that you have in critial condition, the less active cases you have. One could assume from a psycholigical standpoint that if the virus is killing more of your peers, it would encentivize you to practice social distancing more, thus, leading to less active cases.

This visualization is screwed up too. I don’t know why the look fine on my rmd and R files but when I knit it they have less observations. WTF

From this graph, we can see what countries have the most recovered patients. we can see that the US has the most recovered patients, but we have to remember we have the most my a lot. The US has over 1,000,000 cases and the second closest is Spain with 272,646 cases. Yikes. But from this, we can see that there is a correlation to how many patients recover, and how many a country has. tThen there is China, where their numbers don’t make any sense for their population.

Lastly, I should probably discuss which statistical or analytical method I would use. I would use a predictive method, possibly a decision tree or logistic regression to see what the data would look like in the near future.