SHIP/KEF-Richtlinien zur Durchführung qualitätsgesicherter

Transkript

1 Sebastian E Baumeister, Carsten-Oliver Schmidt, Till Ittermann, Henry Völzke Version SHIP/KEF-Richtlinien zur Durchführung qualitätsgesicherter Auswertungen 1 Für Auswertungsprojekte ist im Regelfall ein bewilligter Datennutzungsantrag erforderlich 2 Vor Beginn der Analysen ist ein Analyseplan zu erstellen Dieser ist zumindest zwischen Erst-, Zweit- und Letztautor abzustimmen 3 Variablendefinitionen und statistische Analysen sind vollständig in Programmen (zb Stata-Do-Files und Log-Files) zu dokumentieren Diese Programme sind vor Ersteinreichung des Manuskripts an den Leiter des Publikationskomitees zu schicken (siehe Appendix mit Beispielanalysen) 4 Die Log-Files, die alle Variablendefinitionen und statistischen Analysen beinhalten, sind durch eine zweite Person zu kontrollieren Hierzu erfolgt insbesondere ein Abgleich mit den Tabellen des Manuskripts 5 Die Analysen sind sachgerecht durchzuführen, dies umfasst: die Auswertungen mit adäquaten statistischen Verfahren und die Überprüfung der Annahmen der verwendeten Verfahren [1,2,3] Bei erheblicher Verletzung der Annahmen sind die Analysen entsprechend anzupassen Die Überprüfung der Annahmen ist ebenfalls im Do- und Log-File zu dokumentieren

2 Appendix Allgemeine Hinweise In der Regel stehen die Ergebnisse von Regressionsmodellen in einem Auswertungsprojekt im Vordergrund Bei der Schätzung von Regressionsmodellen ist sicherzustellen, dass die zugrundeliegenden Annahmen erfüllt sind [1,2,3] Substantielle Verletzungen dieser Annahmen können zu Verzerrungen der Schätzer (Koeffizienten, P- Werte, Konfidenzintervalle, etc) führen Wichtige Annahmen im Rahmen klassischer Regressionsmodellen umfassen: a Normalverteilung der Residuen b Linearität c Homoskedaszität d Einflussreiche Beobachtungen e Unabhängigkeit der Residuen f Missing completely at random (MCAR) Annahme bei fehlenden Werten und Ausfällen Hinweise zur Kovariablen-Auswahl Die Berücksichtigung und Auswahl von Kovariablen in einem Regressionsmodell kann unterschiedliche Ziele verfolgen Grundsätzlich ist zwischen Konfounder- und Prädiktionsmodellen zu differenzieren [1,2,3] Im Konfoundermodell steht der Zusammenhang zwischen Exposure und Outcome im Vordergrund Die Auswahl zu adjustierender Konfounder erfolgt theoriegeleitet (zb mittels kausaler Grafen) oder anhand empirischer Kriterien (zb 15%-Change-in-Coefficient) [4] Bei der Entwicklung von Prädiktionsmodellen [3] kommen die theoriegeleitete und empirisch-automatisierte Selektion (wie Stepwise-Verfahren) zur Anwendung [3] Quellen: 1 Vittinghoff et al 2005 Regression Methods in Biostatistics 2 Harrell 2001 Regression Modeling Strategies 3 Steyerberg 2010 Clinical Prediction Models 4 Rothman, Grennland, Lash 2008 Modern Epidemiology Beispielsyntax und Log-File für die Durchführung und Überprüfung einer Regressionsanalyse Das folgende Beispiel dient zur Illustration von Do- und Log-Dateien Analysiert werden Korrelate des Geburtsgewichts mittels linearer Regression 1 Log-File zur Variablendefinition ** Sebastian Baumeister ** Titel of project: Maternal correlates of birth weight ** Pfad: E:\Arbeit\Greifswald\ship_statistik\Richtlinie zur Durchführung von Auswertungen\ glo bw "E:\Arbeit\Greifswald\ship_statistik\Richtlinie zur Durchführung von Auswertungen\data\" 2

3 *Load original dataset and save under new name ******************** use clear (Hosmer & Lemeshow data) save "$bw\bw_smokingdta", replace file E:\Arbeit\Greifswald\ship_statistik\Richtlinie zur Durchführung von Auswertungen\data\\bw_smokingdta saved codebook, c Variable Obs Unique Mean Min Max Label id identification code low birth weight<2500g age age of mother lwt weight at last menstrual period race race smoke smoked during pregnancy ptl premature labor history (count) ht has history of hypertension ui presence, uterine irritability ftv number of visits to physician during 1st trimester bwt birth weight (grams) white race==white black race==black other race==other ptd lwd * Define new outcome variables ************ Low birth weight recode bwt (min/2500=1) ( /max=0), gen(lowbwt) (189 differences between bwt and lowbwt) label variable lowbwt "birth weight" label define lowbwt 1 "1,birth weight<2500g" 0 "0,birth weight>=2500" label value lowbwt lowbwt *Covariables ************ * Age groups recode age (min/19=1) (20/23=2) (24/26=3) (27/max=4) (=), gen(age4) (189 differences between age and age4) save, replace file E:\Arbeit\Greifswald\ship_statistik\Richtlinie zur Durchführung von Auswertungen\data\\bw_smokingdta saved log close Der 2 Log-File unten veranschaulicht das Vorgehen bei der Durchführung und Überprüfung der Annahmen der linearen ( kleinste Quadrate ) Regression 2 Log-File zur Auswertung ** Sebastian Baumeister ** Titel of project: Maternal correlates of birth weight ** Pfad: E:\Arbeit\Greifswald\ship_statistik\Richtlinie zur Durchführung von Auswertungen\ glo bw "E:\Arbeit\Greifswald\ship_statistik\Richtlinie zur Durchführung von Auswertungen\data\" use "$bw\bw_smokingdta", clear (Hosmer & Lemeshow data) des, sh Contains data from E:\Arbeit\Greifswald\ship_statistik\Richtlinie zur Durchführung von Auswertungen\data\\bw_smokingdta obs: 189 Hosmer & Lemeshow data 3

4 vars: Jan :15 size: 5,292 Sorted by: * Analytical sample *********************** glo out bwt glo cov age lwt race smoke ptl ht ui ftv tabmiss $out $cov Variable Obs Missings FeqMissings NonMiss FeqNonMiss bwt age lwt race smoke ptl ht ui ftv egen nm=rowmiss($out $cov) fre nm nm Freq Percent Valid Cum Valid recode nm (0=0) (1/max=1),gen(miss) (0 differences between nm and miss) * No missing values on any variable included in the (regression analyses) * Table 1: Characteritics of the mothers ************************ tabstat age lwt ptl ftv if miss==0, s(n median p25 p75 mean sd) c(s) variable N p50 p25 p75 mean sd age lwt ptl ftv fre race smoke ht ui if miss==0 race -- race -- Freq Percent Valid Cum Valid 1 white black other Total smoke -- smoked during pregnancy Freq Percent Valid Cum Valid Total ht -- has history of hypertension Freq Percent Valid Cum

5 Valid Total ui -- presence, uterine irritability Freq Percent Valid Cum Valid Total * Table 2: Correlates of birth weight ************************ reg bwt age irace ismoke iht lwt ptl ftv if miss==0 Source SS df MS Number of obs = F( 8, 180) = 509 Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = bwt Coef Std Err t P>t [95% Conf Interval] age race smoke ht lwt ptl ftv _cons * Test of linear regression (OLS) assumptions ************************** * Normality of residuals: looks fine ***** predict r, res kdensity r, norm qnorm r pnorm r * Homoscedasticity: looks fine ***** estat hettest age irace ismoke iht lwt ptl ftv, iid Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: age irace ismoke iht lwt ptl ftv chi2(8) = 745 Prob > chi2 = rvfplot, yline(0) * Linearity ***** * Age - looks nonlinear, modeled using restricted cubic splines twoway (scatter bwt age) (lfit bwt age) (lowess bwt age) (fpfit bwt age) acprplot age, lowess lsopts(bwidth(1)) 5

6 centile age if miss==0, c( ) -- Binom Interp -- Variable Obs Percentile Centile [95% Conf Interval] age mkspline2 ag_=age, cubic knots( ) dis knot1 knot2 knot3 knot age * Weight at last menstrual period - looks linear twoway (scatter bwt lwt) (lfit bwt lwt) (lowess bwt lwt) /*(fpfit bwt lwt)*/ * premature labor history twoway (scatter bwt ptl) (lfit bwt ptl) (lowess bwt ptl) /*(fpfit bwt lwt)*/ fre ptl ptl -- premature labor history (count) Freq Percent Valid Cum Valid Total *only few subjects with premature labor history >1 therefore recode with new variables values 0 and 1+ recode ptl (0=0 "0,zero") (1/3=1 "1,1+"), gen(plt_2c) (6 differences between ptl and plt_2c) * Collinearity - not an issue ***** vif Variable VIF 1/VIF age race smoke ht lwt ptl ftv Mean VIF 117 collin bwt age race smoke ht lwt plt_2c ftv if miss==0 (obs=189) Collinearity Diagnostics SQRT R- Variable VIF VIF Tolerance Squared bwt age race smoke ht lwt plt_2c ftv Mean VIF 118 6

7 Cond Eigenval Index Condition Number Eigenvalues & Cond Index computed from scaled raw sscp (w/ intercept) Det(correlation matrix) * Omitted variables / exogeneity -> not suspected ***** linktest Source SS df MS Number of obs = F( 2, 186) = 2123 Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = bwt Coef Std Err t P>t [95% Conf Interval] _hat _hatsq _cons estat ovtest Ramsey RESET test using powers of the fitted values of bwt Ho: model has no omitted variables F(3, 177) = 149 Prob > F = * Influential observations ***** *Cookd distance > 4/N reg bwt ag_* irace ismoke iht lwt ptl ftv if miss==0 Source SS df MS Number of obs = F( 10, 178) = 464 Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = bwt Coef Std Err t P>t [95% Conf Interval] ag_ ag_ ag_ race smoke ht lwt ptl ftv _cons

8 predict d, cooksd li bwt age race smoke ht lwt plt_2c ftv if d>4/ bwt age race smoke ht lwt plt_2c ftv white ,zero other , white , white ,zero black ,zero black ,zero white ,zero other , white ,zero black ,zero other , black ,zero *dfits > 2*sqrt(k/N) predict dfits, dfits quiet scalar thresh=2*sqrt((e(df_m)+1)/e(n)) di "dfits threshold=" %63f thresh dfits threshold= 0482 li dfits bwt age race smoke ht lwt plt_2c ftv if abs(dfits)>2*thresh & e(sample) dfits bwt age race smoke ht lwt plt_2c ftv white , white ,zero black ,zero reg bwt ag_* irace ismoke iht lwt ptl ftv if miss==0 & d<4/189 Source SS df MS Number of obs = F( 10, 166) = 490 Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = bwt Coef Std Err t P>t [95% Conf Interval] ag_ ag_ ag_ race smoke ht lwt ptl ftv _cons reg bwt ag_* irace ismoke iht lwt ptl ftv if miss==0 & abs(dfits)<2*thresh Source SS df MS Number of obs = F( 10, 175) = 414 Model Prob > F = Residual R-squared =

9 Adj R-squared = Total Root MSE = bwt Coef Std Err t P>t [95% Conf Interval] ag_ ag_ ag_ race smoke ht lwt ptl ftv _cons *Some influential observation that distort estimates *Robust and quantile regression rreg bwt ag_* irace ismoke iht lwt ptl ftv Robust regression Number of obs = 189 F( 10, 178) = 444 Prob > F = bwt Coef Std Err t P>t [95% Conf Interval] ag_ ag_ ag_ race smoke ht lwt ptl ftv _cons xi: qreg bwt ag_* irace ismoke iht lwt ptl ftv irace _Irace_1-3 (naturally coded; _Irace_1 omitted) ismoke _Ismoke_0-1 (naturally coded; _Ismoke_0 omitted) iht _Iht_0-1 (naturally coded; _Iht_0 omitted) Iteration 1: WLS sum of weighted deviations = Median regression Number of obs = 189 Raw sum of deviations (about 2977) Min sum of deviations Pseudo R2 = bwt Coef Std Err t P>t [95% Conf Interval] ag_ ag_ ag_ _Irace_ _Irace_ _Ismoke_ _Iht_ lwt ptl ftv _cons

10 *Final robust regression model xi: rreg bwt ag_* irace ismoke iht lwt iplt_2c ftv if miss==0 irace _Irace_1-3 (naturally coded; _Irace_1 omitted) ismoke _Ismoke_0-1 (naturally coded; _Ismoke_0 omitted) iht _Iht_0-1 (naturally coded; _Iht_0 omitted) iplt_2c _Iplt_2c_0-1 (naturally coded; _Iplt_2c_0 omitted) Robust regression Number of obs = 189 F( 10, 178) = 479 Prob > F = bwt Coef Std Err t P>t [95% Conf Interval] ag_ ag_ ag_ _Irace_ _Irace_ _Ismoke_ _Iht_ lwt _Iplt_2c_ ftv _cons testparm ag_* ( 1) ag_1 = 0 ( 2) ag_2 = 0 ( 3) ag_3 = 0 log close F( 3, 178) = 152 Prob > F =