Talking about mortality during this world-wide pandemic is nothing easy, but there are questions to ask. In an ideal world where people have equal access to medical resources, deaths after COVID infections would be purely in the control of the God’s dice, and we expect to see equal death rate across regions. Noticeably, there is difference between the death rate and the number of infected people or mortality. The number of infections can be influenced by factors outside the medical system, for example how government enforces regulations to control the disease spread. Mortality is the multiplication of the number infections and the death rate, thus is under the influence of the same factors. Death rate, on the other hand, relies mostly on access to medical resources. After one gets infected, how promptly and properly can they get treated?
Death rates across US so far differ wildly across different regions. Real-time COVID monitoring built by 1point3acres provides county level statistics. These numbers can be linked with the demographic information from Census Bureau to enable some analyses. Some demographic features presumably can be correlated with death rates. An example is the proportion of aged people. Politicians tell us that COVID is more fatal to old people than young people. The following figure compares the county-level death rate with the proportion of population greater than 60. It seems the positive correlation is true when the proportion stays below 35%, but becomes less pronounced when the proportion exceeds that.
Another example is the proportion of population with health insurance. There are two possible patterns. If infected people need to rely on their own health insurance for treatment, we expect to see negative correlation between health insurance coverage and death rate. Alternatively, if government strongly subsidizes the health system to battle this pandamic, infected people can be treated regardless of their health insurance status, which means such correlation would not exist. Our data seem to support the former.
A more disturbing example is the median household income. We hope to see no correlation between income levels and COVID death rates, assuming that in an ideal world medical resources are deployed to all regions independent of their economic status. Data, however, show negative correlation between the two, and the correlation even seems to be stronger than the correlation with the proportion of aged population. Comparing counties with median household income below 50K with those above 100K, death rates are mostly above 3% for the former counties and below 3% for the latter.
Features can have confounding effects. The proportion of aged population can be correlated with median household income. Once controlling other factors, what is the effect of median household income? To answer such questions, I run a Generalized Linear Model with binomial distribution including the aforementioned three variables. Backward selection is used to exclude insignificant variables. Proportion of population older than 60 is the only variable that is excluded. The model results are shown in the following table. To interpret, every 1% increase in health insurace coverage is associated with 4.44% decrease in the log odds of COVID death rate, and every $10K increase in median household income is associated with 3.35% decrease in the log odds of COVID death rate.
Variable | Coefficient Estimate | P-value |
Intercept | -2.700e+00 | < 2e-16 *** |
Median Household Income | -3.350e-06 | 0.000164 *** |
Proportion of Population with Health Insurance | -4.444e+00 | 2.5e-06 *** |
Correlation is not causation. Factors outside this model may play a role in reality. For example, maybe counties with higher income also have better educated population, and thus are more informative to take early actions when symptoms develop. To improve this model along this direction, I include more demographic features. The next model I build includes the following features:
The model results are in the following table:
Variable | Coefficient Estimate | P-value |
Intercept | -3.8715 | < 2e-16 *** |
Proportion of Population Older than 60 | 2.0019 | 2.34e-05 *** |
Proportion of Population with College Degree | -1.1698 | 3.37e-08 *** |
Proportion of Families with Two or More Races | -3.4262 | 0.0533 . |
Proportion of Black or African American Families | 0.6147 | 6.11e-08 *** |
Proportion of Asian Families | 2.6453 | 3.85e-13 *** |
Proportion of Families with Some Other Races | -1.5529 | 6.83e-05 *** |
Proportion of Haiwaiian or Other Pacific Islander Families | -11.2095 | 0.0966 . |
The results are drastically different from the first model. After introducing additional demographic variables, both median household income and health insurance coverage are no longer significant. Rather, the most significant factor is the proportion of Asian Families. Controlling all other factors in this model, with every 1% increase in the proportion of Asian families, there is 2.64% increase in the death rate. This effect is even larger than the proportion of aged population. Another race that is fared less fortunate than other races is the Black and African Americans. Every 1% increase is associated with 0.61% increase in COVID death rate.
How does this model fit? The following figure shows the log ratio of the actual death rate v.s. the predicted death rate.
My model apparently has many shortcomings, which include but are not restricted to:
For counties with at least 1K infections, this figure shows how the predicted death rate differs from the actual death rate. Food for thoughts:
TECHNICAL DETAILS
CENSUS DATA
I use 2018 American Community Survey 5-year Estimate data for county-level demographics in the US. Information on how to download data can be found on Census Bureau’s developer website.
COVID DATA
I manually copied data from 1point3acres Live-time Report snapshot on 4/4/2020 at 2pm EDT.
DATA ANALYSIS
The final model includes 2238 US counties where I can match COVID data with the ACS survey data.
6 counties are excluded from model fitting because they have high leverage according to the Cook’s distance. These counties are shown in the table below. Acknowledgably, they are the most infected counties by COVID in the US, but they also show quite distinct pattern compared to other US counties. The model excludes them in order to analyze the pattern for general US counties.
State | County | Infected | Death |
Florida | Miami-Dade | 3667 | 30 |
New York | Nassau | 13346 | 95 |
New York | New York | 63306 | 1905 |
New York | Suffolk | 11370 | 96 |
New York | Westchester | 13081 | 67 |
Washington | King | 2787 | 188 |
All codes can be found here.
(c)2017-2026 CHANDLER ZUO ALL RIGHTS PRESERVED